radical-cybertools / radical.entk

The RADICAL Ensemble Toolkit
https://radical-cybertools.github.io/entk/index.html
Other
28 stars 17 forks source link

ornl.summit does not seem to work from Summit login node #384

Closed mturilli closed 3 years ago

mturilli commented 4 years ago

I am opening this ticket for @wilkinson, based on a mail thread now closed.

Stack: Unknown. @wilkinson please run radical-stack and add the output to this ticket.

@wilkinson is trying to run the following script from Summit login node:

import os
from radical.entk import AppManager, Pipeline, Stage, Task

# Create objects.

appman = AppManager(
    hostname=os.environ["RMQ_HOSTNAME"],
    port=os.environ["RMQ_PORT"],
    username=os.environ["RMQ_USERNAME"],
    password=os.environ["RMQ_PASSWORD"])

p = Pipeline()

s = Stage()

t = Task()

# Use the objects to model the workflow.

appman.resource_desc = {
    'resource': 'ornl.summit',
    'walltime': 10,
    'cpus': 1
}

t.name = 'my-first-task'
t.executable = '/bin/echo'
t.arguments = ['Hello world!']

s.add_tasks(t)

p.add_stages(s)

appman.workflow = set([p])

# Execute the workflow.

appman.run()

@wilkinson please add to this ticket a C&P of the trace the script returns, if any.

In attachment the logs/session of the run. ornl-summit-demo.tar.gz

wilkinson commented 4 years ago

Output from radical-stack:

  python               : 2.7.15
  pythonpath           : /autofs/nccs-svm1_sw/summit/.swci/0-core/opt/spack/20180914/linux-rhel7-ppc64le/gcc-4.8.5/py-virtualenv-16.0.0-ohsvxc5mf4aornhyrfp4ecea5bzcowon/lib/python2.7/site-packages:/sw/summit/xalt/1.1.4/site:/sw/summit/xalt/1.1.4/libexec
  virtualenv           : /ccs/home/seanwilk/entk-with-passwords

  radical.entk         : 0.72.0-v0.70.0-60-gc73e031@devel
  radical.pilot        : 0.73.1
  radical.saga         : 0.72.1
  radical.utils        : 0.72.0
mturilli commented 4 years ago

Replication and debugging of the issue.

Setup:

module load python/2.7.15
module load py-virtualenv/16.0.0-py2
module load py-pip/10.0.1-py2
module load vim
module load py-setuptools/40.4.3-py2
export RADICAL_PILOT_DBURL="xxxx" 
export RADICAL_LOG_LVL="DEBUG"
export RADICAL_LOG_TGT="radical.log"
export RADICAL_PROFILE="TRUE"
export RMQ_HOSTNAME="xxxx"
export RMQ_PORT="xxxxx"

Installation stack (Note: pip install is broken, might be a good idea to open a ticket with ORNL?):

$ python -m pip install radical.entk
$ radical-stack 

  python               : 2.7.15
  pythonpath           : /autofs/nccs-svm1_sw/summit/.swci/0-core/opt/spack/20180914/linux-rhel7-ppc64le/gcc-4.8.5/py-setuptools-40.4.3-rc56sxgpafwvs5eyrvc3uxiaqoc6oe2f/lib/python2.7/site-packages:/autofs/nccs-svm1_sw/summit/.swci/0-core/opt/spack/20180914/linux-rhel7-ppc64le/gcc-4.8.5/py-pip-10.0.1-2gr5x7tsnuxwissqhzapdbmlpheove3i/lib/python2.7/site-packages:/autofs/nccs-svm1_sw/summit/.swci/0-core/opt/spack/20180914/linux-rhel7-ppc64le/gcc-4.8.5/py-virtualenv-16.0.0-phcok3x4eyd36qfh5ptv66isyol4ui4b/lib/python2.7/site-packages:/sw/summit/xalt/1.1.4/site:/sw/summit/xalt/1.1.4/libexec
  virtualenv           : /ccs/home/mturilli1/ve/entk_384

  radical.entk         : 0.72.1
  radical.pilot        : 0.73.1
  radical.saga         : 0.72.1
  radical.utils        : 0.72.0

Execution script:

$ python entk_384.py 
EnTK session: re.session.login5.mturilli1.018198.0000
Creating AppManagerSetting up RabbitMQ system                                 ok
                                                                              ok
Validating and assigning resource manager                                     ok
Setting up RabbitMQ system                                                   n/a
                                                                              ok
create pilot manager                                                          ok
submit 1 pilot(s)
        [ornl.summit:1]
                                                                              ok
wait for 1 pilot(s)
                                                                              ok
closing session re.session.login5.mturilli1.018198.0000                        \
close pilot manager                                                            \
wait for 1 pilot(s)
                                                                         timeout
                                                                              ok
session lifetime: 10.2s                                                       ok
All components terminated
Traceback (most recent call last):
  File "entk_384.py", line 44, in <module>
    appman.run()
  File "/ccs/home/mturilli1/ve/entk_384/lib/python2.7/site-packages/radical/entk/appman/appmanager.py", line 373, in run
    'ended in state %s' % res_alloc_state)
radical.entk.exceptions.EnTKError: Cannot proceed. Resource ended in state FAILED

Debugging:

less radical.log
[...]
2019-10-29 18:18:22,206: radical.saga.cpi    : pmgr.0000.launching.0           : MainThread     : DEBUG   : run_sync: mkdir -p / && cd / &&  mkdir -p '/gpfs/alpine/scratch/mturilli1/bip178/radical.pilot.sandb
ox/re.session.login5.mturilli1.018198.0000/'
2019-10-29 18:18:22,206: radical.saga.pty    : pmgr.0000.launching.0           : MainThread     : DEBUG   : write: [  167] [  135] (mkdir -p / && cd / &&  mkdir -p '/gpfs/alpine/scratch/mturilli1/bip178/radic
al.pilot.sandbox/re.session.login5.mturilli1.018198.0000/'\n)
2019-10-29 18:18:22,244: radical.saga.pty    : pmgr.0000.launching.0           : MainThread     : DEBUG   : read : [  167] [   91] (mkdir: cannot create directory '/gpfs/alpine/scratch/mturilli1/bip178': Perm
ission denied\n)
[...]

@wilkinson can you please try to replicate this and confirm that it fixes your issue?

andre-merzky commented 4 years ago

Hey @mturilli , @wilkinson , thanks for diving into this!

The sandbox location issue actually should have been addressed by https://github.com/radical-cybertools/radical.pilot/pull/1922 - or at least that should provide a framework for addressing it by enabling a mechanism to refer the job allocation in the sandbox setting. We should adjust the config entries for summit to use that mechanism it seems?

As for radical-cybertools/radical.saga#472, one of us really should block some days before the next tutorial to track this down. This eludes us for so long already... :-/ Lets discuss on the next devel call wrt. other priorities.

lee212 commented 4 years ago

PR #385, is that Appmanager converts 'port' to integer now at __init__ which will avoid TypeError from string for example.

wilkinson commented 4 years ago

Hi @mturilli, thank you for looking at this so promptly, and my apologies for taking so long to test your solution. Unfortunately, the solution still does not work for me.

First, I exported the same environment variables for debugging that you showed above.

Next, I changed my demonstration script to the following:

import os
from radical.entk import AppManager, Pipeline, Stage, Task

# Create objects.

appman = AppManager(
    hostname=os.environ["RMQ_HOSTNAME"],
    port=int(os.environ["RMQ_PORT"]), # THIS LINE CHANGED
    username=os.environ["RMQ_USERNAME"],
    password=os.environ["RMQ_PASSWORD"])

p = Pipeline()

s = Stage()

t = Task()

# Use the objects to model the workflow.

appman.resource_desc = {
    'resource': 'local.localhost',
    'walltime': 10,
    'cpus': 1,
    'project': 'stf011' # THIS LINE CHANGED
}

t.name = 'my-first-task'
t.executable = '/bin/echo'
t.arguments = ['Hello world!']

s.add_tasks(t)

p.add_stages(s)

appman.workflow = set([p])

# Execute the workflow.

appman.run()

Then, I edited ./lib/python2.7/site-packages/radical/pilot/configs/resource_ornl.json by changing all occurrences of $MEMBERWORK/bip178 to $MEMBERWORK/stf011.

I run the script from the login node again:

$ python2 demo.py 
EnTK session: re.session.login3.seanwilk.018204.0005
Creating AppManagerSetting up RabbitMQ system                                 ok
                                                                              ok
Validating and assigning resource manager                                     ok
Setting up RabbitMQ system                                                   n/a
                                                                              ok
create pilot manager                                                          ok
submit 1 pilot(s)
        [local.localhost:1]
                                                                              ok
wait for 1 pilot(s)
                                                                              ok
closing session re.session.login3.seanwilk.018204.0005                         \
close pilot manager                                                            \
wait for 1 pilot(s)
                                                                         timeout
                                                                              ok
session lifetime: 12.2s                                                       ok
All components terminated
Traceback (most recent call last):
  File "demo.py", line 44, in <module>
    appman.run()
  File "/ccs/home/seanwilk/entk-with-passwords/lib/python2.7/site-packages/radical/entk/appman/appmanager.py", line 384, in run
    'ended in state %s' % res_alloc_state)
radical.entk.exceptions.EnTKError: Cannot proceed. Resource ended in state FAILED

When I check radical.log, I find the following line:

2019-11-04 14:23:44,372: radical.saga        : pmgr.0000.launching.0           : MainThread     : ERROR   : BadParameter: 'JobDescription.Project' (stf011) not supported by radical.saga.adaptors.shell_job

But, when I changed stf011 to STF011 in the demo script, I still get exactly the same errors, so I'm not sure where that is coming from or how it is becoming lowercase. I don't know if it's stuck in a cached file somewhere. I have included radical.log here.

radical.log

wilkinson commented 4 years ago

I found the ~/.radical/ directory and removed it. That made things a lot worse right now. It looks like I have a session "active" that I can't figure out how to remove. I'll probably have to figure out how to clear things on MongoDB, too.

$ python2 demo.py 
EnTK session: re.session.login3.seanwilk.018204.0001
Creating AppManagerSetting up RabbitMQ system                                 ok
                                                                              ok
Validating and assigning resource manager                                     ok
Setting up RabbitMQ system                                                   n/a
                                                                              ok
create pilot managerAll components terminated
Traceback (most recent call last):
  File "demo.py", line 44, in <module>
    appman.run()
  File "/ccs/home/seanwilk/entk-with-passwords/lib/python2.7/site-packages/radical/entk/appman/appmanager.py", line 379, in run
    self._rmgr._submit_resource_request()
  File "/ccs/home/seanwilk/entk-with-passwords/lib/python2.7/site-packages/radical/entk/execman/rp/resource_manager.py", line 149, in _submit_resource_request
    self._pmgr    = rp.PilotManager(session=self._session)
  File "/ccs/home/seanwilk/entk-with-passwords/lib/python2.7/site-packages/radical/pilot/pilot_manager.py", line 132, in __init__
    self._session._register_pmgr(self)
  File "/ccs/home/seanwilk/entk-with-passwords/lib/python2.7/site-packages/radical/pilot/session.py", line 692, in _register_pmgr
    self._dbs.insert_pmgr(pmgr.as_dict())
  File "/ccs/home/seanwilk/entk-with-passwords/lib/python2.7/site-packages/radical/pilot/db/database.py", line 186, in insert_pmgr
    result = self._c.insert(pmgr_doc)
  File "/ccs/home/seanwilk/entk-with-passwords/lib/python2.7/site-packages/pymongo/collection.py", line 3182, in insert
    check_keys, manipulate, write_concern)
  File "/ccs/home/seanwilk/entk-with-passwords/lib/python2.7/site-packages/pymongo/collection.py", line 612, in _insert
    bypass_doc_val, session)
  File "/ccs/home/seanwilk/entk-with-passwords/lib/python2.7/site-packages/pymongo/collection.py", line 600, in _insert_one
    acknowledged, _insert_command, session)
  File "/ccs/home/seanwilk/entk-with-passwords/lib/python2.7/site-packages/pymongo/mongo_client.py", line 1492, in _retryable_write
    return self._retry_with_session(retryable, func, s, None)
  File "/ccs/home/seanwilk/entk-with-passwords/lib/python2.7/site-packages/pymongo/mongo_client.py", line 1385, in _retry_with_session
    return func(session, sock_info, retryable)
  File "/ccs/home/seanwilk/entk-with-passwords/lib/python2.7/site-packages/pymongo/collection.py", line 597, in _insert_command
    _check_write_command_response(result)
  File "/ccs/home/seanwilk/entk-with-passwords/lib/python2.7/site-packages/pymongo/helpers.py", line 221, in _check_write_command_response
    _raise_last_write_error(write_errors)
  File "/ccs/home/seanwilk/entk-with-passwords/lib/python2.7/site-packages/pymongo/helpers.py", line 202, in _raise_last_write_error
    raise DuplicateKeyError(error.get("errmsg"), 11000, error)
pymongo.errors.DuplicateKeyError: E11000 duplicate key error collection: test.re.session.login3.seanwilk.018204.0001 index: _id_ dup key: { _id: "pmgr.0000" }
^CException KeyboardInterrupt: KeyboardInterrupt() in <module 'threading' from '/autofs/nccs-svm1_sw/summit/.swci/0-core/opt/spack/20180914/linux-rhel7-ppc64le/gcc-4.8.5/python-2.7.15-jhok7d6zokcxqb27ze7bv2pnn2b4qvbp/lib/python2.7/threading.pyc'> ignored
closing session re.session.login3.seanwilk.018204.0001     \
session lifetime: 2408.6s                                                     ok

radical.log

wilkinson commented 4 years ago

Okay, I made a mistake when repeating the directions and accidentally used local.localhost instead of ornl.summit. After removing the ${HOME}/radical.pilot.sandbox/ directory and launching the script fresh, I see what looks to be a correctly executing script. I'm still waiting for it to finish, but Update: pipeline.0000.stage.0000 state: DONE looks very promising.

wilkinson commented 4 years ago

Update: I don't think things finished yet. The output from bjobs tells me:

$ bjobs
No unfinished job found

But, EnTK has never given control back to my command prompt. Here is the live output:

python2 demo.py 
EnTK session: re.session.login2.seanwilk.018204.0004
Creating AppManagerSetting up RabbitMQ system                                 ok
                                                                              ok
Validating and assigning resource manager                                     ok
Setting up RabbitMQ system                                                   n/a
                                                                              ok
create pilot manager                                                          ok
submit 1 pilot(s)
        [ornl.summit:1]
                                                                              ok
All components created
Update: pipeline.0000 state: SCHEDULING
Update: pipeline.0000.stage.0000 state: SCHEDULING
Update: pipeline.0000.stage.0000.my-first-task state: SCHEDULING
Update: pipeline.0000.stage.0000.my-first-task state: SCHEDULED
Update: pipeline.0000.stage.0000 state: SCHEDULED
create unit manager/ccs/home/seanwilk/entk-with-passwords/lib/python2.7/site-packages/pymongo/topology.py:155: UserWarning: MongoClient opened before fork. Create MongoClient only after forking. See PyMongo's documentation for details: http://api.mongodb.org/python/current/faq.html#is-pymongo-fork-safe
  "MongoClient opened before fork. Create MongoClient only "
                                                           ok
add 1 pilot(s)                                                                ok
Update: pipeline.0000.stage.0000.my-first-task state: SUBMITTING
submit 1 unit(s)
        .                                                                     ok
Update: pipeline.0000.stage.0000.my-first-task state: EXECUTED
Update: pipeline.0000.stage.0000.my-first-task state: DONE
Update: pipeline.0000.stage.0000 state: DONE

It hasn't finished yet, and it has been a couple of hours. I terminate with Ctrl-C:

^CProcess task-manager:
Traceback (most recent call last):
  File "/autofs/nccs-svm1_sw/summit/.swci/0-core/opt/spack/20180914/linux-rhel7-ppc64le/gcc-4.8.5/python-2.7.15-jhok7d6zokcxqb27ze7bv2pnn2b4qvbp/lib/python2.7/multiprocessing/process.py", line 267, in _bootstrap
    self.run()
  File "/autofs/nccs-svm1_sw/summit/.swci/0-core/opt/spack/20180914/linux-rhel7-ppc64le/gcc-4.8.5/python-2.7.15-jhok7d6zokcxqb27ze7bv2pnn2b4qvbp/lib/python2.7/multiprocessing/process.py", line 114, in run
    self._target(*self._args, **self._kwargs)
  File "/ccs/home/seanwilk/entk-with-passwords/lib/python2.7/site-packages/radical/entk/execman/rp/task_manager.py", line 168, in _tmgr
    mq_channel.basic_get(queue=pending_queue[0])
  File "/ccs/home/seanwilk/entk-with-passwords/lib/python2.7/site-packages/pika/adapters/blocking_connection.py", line 2077, in basic_get
    self._basic_getempty_result.is_ready)
  File "/ccs/home/seanwilk/entk-with-passwords/lib/python2.7/site-packages/pika/adapters/blocking_connection.py", line 1292, in _flush_output
    *waiters)
  File "/ccs/home/seanwilk/entk-with-passwords/lib/python2.7/site-packages/pika/adapters/blocking_connection.py", line 458, in _flush_output
    self._impl.ioloop.poll()
  File "/ccs/home/seanwilk/entk-with-passwords/lib/python2.7/site-packages/pika/adapters/select_connection.py", line 495, in poll
    self._poller.poll()
  File "/ccs/home/seanwilk/entk-with-passwords/lib/python2.7/site-packages/pika/adapters/select_connection.py", line 1102, in poll
    events = self._poll.poll(self._get_max_wait())
KeyboardInterrupt
wait for 1 pilot(s)
                                                                              ok
closing session re.session.login2.seanwilk.018204.0004                         \
close pilot manager                                                            \
wait for 1 pilot(s)
                                                                         timeout
                                                                              ok
session lifetime: 7134.4s                                                     ok
All components terminated
Traceback (most recent call last):
  File "demo.py", line 44, in <module>
    appman.run()
  File "/ccs/home/seanwilk/entk-with-passwords/lib/python2.7/site-packages/radical/entk/appman/appmanager.py", line 404, in run
    raise KeyboardInterrupt
KeyboardInterrupt

I'm not sure what hung, but this line from radical.log seems interesting:

2019-11-04 15:28:28,081: radical.saga.cpi    : pmgr.0000.launching.0           : Thread-2       : INFO    : update Job <radical.saga.adaptors.lsf.lsfjob.LSFJob object at 0x7fffb43e7910> (state: Failed)

I have attached the full radical.log below.

radical.log

mturilli commented 4 years ago

Thanks @wilkinson , I am sorry this is still broken. I am going to try to replicate your issue. Can I have the output of your radical-stack please? Also, are you using rabbitmq and mongodb within ORNL?

wilkinson commented 4 years ago

Hi @mturilli,

The stack I am using is my own fork of EnTK that corresponds to my pull request (#379). I am using that in order to use RabbitMQ and MongoDB services running on Slate at OLCF.

$ radical-stack 

  python               : 2.7.15
  pythonpath           : /autofs/nccs-svm1_sw/summit/.swci/0-core/opt/spack/20180914/linux-rhel7-ppc64le/gcc-4.8.5/py-virtualenv-16.0.0-ohsvxc5mf4aornhyrfp4ecea5bzcowon/lib/python2.7/site-packages:/sw/summit/xalt/1.1.4/site:/sw/summit/xalt/1.1.4/libexec
  virtualenv           : /ccs/home/seanwilk/entk-with-passwords

  radical.entk         : 0.72.0-v0.70.0-60-gc73e031@devel
  radical.pilot        : 0.73.1
  radical.saga         : 0.72.1
  radical.utils        : 0.72.0

I tried launching the demo again just now to try to reproduce the line of interest (with update Job <radical.saga.adaptors.lsf.lsfjob.LSFJob object at 0x7fffb43e7910> (state: Failed) in it) from my previous post, but it did not appear in the output this time. Maybe I killed it too soon this time, because apparently I only waited 153 seconds before killing with Ctrl-C.

mturilli commented 4 years ago

@wilkinson I was not able to replicate the error you reported. My script (the same as yours barring RMQ, MDB endpoints and allocation conf) is exactly as yours. I use a slightly updated stack, compared to yours:

$ radical-stack 

  python               : 2.7.15
  virtualenv           : /autofs/nccs-svm1_home1/mturilli1/test/ve/entk_384

  radical.entk         : 0.72.1
  radical.pilot        : 0.73.1
  radical.saga         : 0.72.1
  radical.utils        : 0.72.0

You may want to try to update your stack. As you can see and as previously already reported, I sucessfully executed your script in 144.4s:

$ python entk_384.py 
EnTK session: re.session.login2.mturilli1.018225.0001
Creating AppManagerSetting up RabbitMQ system                                 ok
                                                                              ok
Validating and assigning resource manager                                     ok
Setting up RabbitMQ system                                                   n/a
                                                                              ok
create pilot manager                                                          ok
submit 1 pilot(s)
        [ornl.summit:1]
                                                                              ok
Update: All components created
pipeline.0000 state: SCHEDULING
Update: pipeline.0000.stage.0000 state: SCHEDULING
Update: pipeline.0000.stage.0000.my-first-task state: SCHEDULING
Update: pipeline.0000.stage.0000.my-first-task state: SCHEDULED
Update: pipeline.0000.stage.0000 state: SCHEDULED
create unit manager/autofs/nccs-svm1_home1/mturilli1/test/ve/entk_384/lib/python2.7/site-packages/pymongo/topology.py:155: UserWarning: MongoClient opened before fork. Create MongoClient only after forking. See PyMongo's documentation for details: http://api.mongodb.org/python/current/faq.html#is-pymongo-fork-safe
  "MongoClient opened before fork. Create MongoClient only "
                                                           ok
add 1 pilot(s)                                                                ok
Update: pipeline.0000.stage.0000.my-first-task state: SUBMITTING
submit 1 unit(s)
        .                                                                     ok
Update: pipeline.0000.stage.0000.my-first-task state: EXECUTED
Update: pipeline.0000.stage.0000.my-first-task state: DONE
Update: pipeline.0000.stage.0000 state: DONE
Update: pipeline.0000 state: DONE
wait for 1 pilot(s)
                                                                              ok
closing session re.session.login2.mturilli1.018225.0001                        \
close pilot manager                                                            \
wait for 1 pilot(s)
                                                                         timeout
                                                                              ok
session lifetime: 144.4s                                                      ok
All components terminated

Here the output of your task:

$ cat /gpfs/alpine/scratch/mturilli1/bip179/radical.pilot.sandbox/re.session.login2.mturilli1.018225.0001/pilot.0000/unit.000000/STDOUT 
Hello world!

I am going to assume that there are some differences in the way in which you setup your environment or some issues with your RMQ/MDB setup/conf. You are welcome to try out my script+config and see whether it also works for you. This would confirm that the issue is indeed with your RMQ/MDB setup/conf. Following this confirmation, if you add a unit test to your PR #379, we will be able to merge it and then try to use your RMQ/MDB setup/conf ouselves for debugging.

Please let me know how you want to proceed.

mturilli commented 4 years ago

@wilkinson here the script I used. I sent you via email my setup as it contains sensitive strings.

#- Python 2.7 source code

#- demo.py ~~
#

import os
from radical.entk import AppManager, Pipeline, Stage, Task

# Create objects.

appman = AppManager(
    hostname=os.environ["RMQ_HOSTNAME"],
    port=int(os.environ["RMQ_PORT"]),
#    username=os.environ["RMQ_USERNAME"],
#    password=os.environ["RMQ_PASSWORD"]
)

p = Pipeline()

s = Stage()

t = Task()

# Use the objects to model the workflow.

appman.resource_desc = {
    'resource': 'ornl.summit',
    'walltime': 10,
    'cpus': 1,
    'project': 'bip179'
}

t.name = 'my-first-task'
t.executable = '/bin/echo'
t.arguments = ['Hello world!']

s.add_tasks(t)

p.add_stages(s)

appman.workflow = set([p])

# Execute the workflow.

appman.run()

#- vim:set syntax=python: