radical-cybertools / radical.pilot

RADICAL-Pilot
http://radical-cybertools.github.io/radical-pilot/index.html
Other
54 stars 23 forks source link

Executing Function Calls with RP #2404

Closed darkwhite29 closed 3 years ago

darkwhite29 commented 3 years ago
$ python -V
Python 3.6.10 :: Anaconda, Inc.

$ radical-stack
  python               : /ccs/home/litan/miniconda3/envs/mocu/bin/python3
  pythonpath           : /sw/summit/xalt/1.2.1/site:/sw/summit/xalt/1.2.1/libexec
  version              : 3.6.10
  virtualenv           : mocu

  radical.entk         : 1.6.5
  radical.gtod         : 1.5.0
  radical.pilot        : 1.6.6-v1.6.6-91-g3de34a4@feature-funcs_v2
  radical.saga         : 1.6.6
  radical.utils        : 1.6.6

$ export RADICAL_PROFILE=TRUE
$ export RADICAL_PILOT_PROFILE=TRUE
$ export RADICAL_ENTK_PROFILE=TRUE
$ export RADICAL_VERBOSE=DEBUG
$ export RADICAL_LOG_LVL=DEBUG
$ export RADICAL_LOG_TGT=radical.log

The brief background: A function named computeExpectedRemainingMOCU() that contains primary CUDA code is needed to be run on GPU.

Now I have an error when running the workflow:

(mocu) [litan@login4.summit N7]$ python runMainForPerformanceMeasure.py
             Unstable system has been found
Round:  0 / 1 - iODE Iteration:  0  Initial MOCU:  1.5268638549428997  Computation time:  2.120206356048584
iterative:  True

================================================================================
 An HPC Workflow for MOCU on GPU
================================================================================

new session: [rp.session.login4.litan.018799.0027]                             \
database   : [mongodb://rct:****@apps.marble.ccs.ornl.gov:32020/rct_test]     ok

--------------------------------------------------------------------------------
submit pilots

create pilot manager                                                          ok
submit 1 pilot(s)
        pilot.0000   ornl.summit             672 cores      24 gpus           ok

--------------------------------------------------------------------------------
submit tasks

create task manager                                                           ok
create: ########################################################################
submit: ########################################################################

--------------------------------------------------------------------------------
gather results

wait  : Traceback (most recent call last):
  File "runMainForPerformanceMeasure.py", line 97, in <module>
    MOCUCurve, experimentSequence, timeComplexity = findMOCUSequence(criticalK, isSynchronized, MOCUInitial, K_max, w, N, deltaT, MVirtual, MReal, TVirtual, TReal, aLowerUpdated, aUpperUpdated, it_idx, update_cnt, iterative = iterative)
  File "/gpfs/alpine/csc299/scratch/litan/MOCU/Byung-Jun/new/ExaLearn-ODED-Kuramoto-main/N7/findMOCUSequence.py", line 145, in findMOCUSequence
    tmgr.wait_tasks()
  File "/ccs/home/litan/miniconda3/envs/mocu/lib/python3.6/site-packages/radical/pilot/task_manager.py", line 1006, in wait_tasks
    time.sleep (0.1)
KeyboardInterrupt

Code and log files for client and agent are attached in a zip file below for your information. function_call_issue.zip

AymenFJA commented 3 years ago

Hey @darkwhite29. By looking at bootstrap_0.out it seems like your agent side python environment was not created properly for some reason since one of the serialization packages is missing dill.

bootstrap_0.out :

purge install source at radical.pilot-1.6.6/
1624331006.0000,rp_install_stop,bootstrap_0,MainThread,pilot.0000,PMGR_ACTIVE_PENDING,
1624331006.0000,ve_setup_stop,bootstrap_0,MainThread,pilot.0000,PMGR_ACTIVE_PENDING,
verify python viability: /gpfs/alpine/csc299/scratch/litan/radical.pilot.sandbox/ve.ornl.summit.1.6.6/bin/python ... ok
verify module viability: radical.pilot   ...Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/gpfs/alpine/scratch/litan/csc299/radical.pilot.sandbox/rp.session.login4.litan.018799.0027/pilot.0000/rp_install/lib/python3.7/site-packages/radical/pilot/__init__.py", line 43, in <module>
    from . import agent
  File "/gpfs/alpine/scratch/litan/csc299/radical.pilot.sandbox/rp.session.login4.litan.018799.0027/pilot.0000/rp_install/lib/python3.7/site-packages/radical/pilot/agent/__init__.py", line 7, in <module>
    from .mpi_worker       import MPI_Func_Worker
  File "/gpfs/alpine/scratch/litan/csc299/radical.pilot.sandbox/rp.session.login4.litan.018799.0027/pilot.0000/rp_install/lib/python3.7/site-packages/radical/pilot/agent/mpi_worker.py", line 7, in <module>
    import dill
ModuleNotFoundError: No module named 'dill'
 failed
python installation cannot load module radical.pilot - abort

What you can do is to delete the existed RP-environment in the agent side and let RP do a fresh install and see if that works. If that did not work, then you can create your own venv install radical pilot in it and let RP use that specific env by specifying the path to that env in the resource_ornl.json file and the set "virtenv_mode": "use",

darkwhite29 commented 3 years ago

Thank you for digging this out. I reinstalled a new Python environment and did a fresh install of RP. However, the same error occurred. I checked to see if the dill package is installed:

(mocu2) [litan@login5.summit N7]$ pip install dill
Requirement already satisfied: dill in /autofs/nccs-svm1_home1/litan/miniconda3/envs/mocu2/lib/python3.7/site-packages (0.3.4)

I am not sure if the error really is related to incorrectly installed RCT stack, but it is possible that the installed dill package is somehow not read correctly by RCT. Below is my stack:

(mocu2) [litan@login5.summit N7]$ radical-stack

  python               : /ccs/home/litan/miniconda3/envs/mocu2/bin/python3
  pythonpath           : /sw/summit/xalt/1.2.1/site:/sw/summit/xalt/1.2.1/libexec
  version              : 3.7.9
  virtualenv           : mocu2

  radical.entk         : 1.6.5
  radical.gtod         : 1.5.0
  radical.pilot        : 1.6.6-v1.6.6-91-g3de34a4@feature-funcs_v2
  radical.saga         : 1.6.6
  radical.utils        : 1.6.6

Any thoughts at this point?

andre-merzky commented 3 years ago

Did you recreate the VENV used by the script you are running, or the one used by the pilot in radical.pilot.sandbox? mocu2 sounds like the former, but @AymenFJA referred to the latter, which is the one which triggers the error in the executor.

darkwhite29 commented 3 years ago

@andre-merzky Yes indeed, I recreated a new conda virtual environment and reinstalled RP there but did not work. Now I am trying the second approach @AymenFJA suggested.

Just to confirm I am doing it correctly: I have an old mocu environment and a new mocu-pilot environment (a fresh installation of RP is done in mocu-pilot and I saw dill is successfully installed).

...
Successfully built radical.pilot
Installing collected packages: urllib3, idna, chardet, setproctitle, requests, regex, radical.gtod, pyzmq, pymongo, netifaces, msgpack, future, colorama, radical.utils, parse, apache-libcloud, whichcraft, radical.saga, python-hostlist, ntplib, mpi4py, dill, radical.pilot
Successfully installed apache-libcloud-3.3.1 chardet-4.0.0 colorama-0.4.4 dill-0.3.4 future-0.18.2 idna-2.10 mpi4py-3.0.3 msgpack-1.0.2 netifaces-0.11.0 ntplib-0.4.0 parse-1.19.0 pymongo-3.11.4 python-hostlist-1.21 pyzmq-22.1.0 radical.gtod-1.5.0 radical.pilot-1.6.6 radical.saga-1.6.6 radical.utils-1.6.6 regex-2021.4.4 requests-2.25.1 setproctitle-1.2.2 urllib3-1.26.5 whichcraft-0.6.1

Now in the resource_ornl.json of mocu, I need to use the RP installed in mocu-pilot. Below is the content of resource_ornl.json of mocu:

"summit": {
        "description"                 : "ORNL's summit, a Cray XK7",
        "notes"                       : null,
        "schemas"                     : ["local"],
        "local"                       : {
            "job_manager_hop"         : "fork://localhost/",
            "job_manager_endpoint"    : "lsf://localhost/",
            "filesystem_endpoint"     : "file://localhost/"
        },
        "default_queue"               : "batch",
        "resource_manager"            : "LSF_SUMMIT",
        "agent_config"                : "default",
        "agent_scheduler"             : "CONTINUOUS",
        "agent_spawner"               : "POPEN",
        "agent_launch_method"         : "JSRUN",
        "task_launch_method"          : "JSRUN",
        "mpi_launch_method"           : "JSRUN",
        "pre_bootstrap_0"             : ["module unload xl",
                                         "module unload xalt",
                                         "module unload spectrum-mpi",
                                         "module unload py-pip",
                                         "module unload py-virtualenv",
                                         "module unload py-setuptools",
                                         "module load   gcc/8.1.1",
                                         "module load   zeromq/4.2.5",
                                         "module load   python/3.7.0",
                                         "module list"],
        "pre_bootstrap_1"             : ["module unload xl",
                                         "module unload xalt",
                                         "module unload spectrum-mpi",
                                         "module unload py-pip",
                                         "module unload py-virtualenv",
                                         "module unload py-setuptools",
                                         "module load   gcc/8.1.1",
                                         "module load   zeromq/4.2.5",
                                         "module load   python/3.7.0",
                                         # increase process limit on node
                                         "ulimit -u 65536"],
        "valid_roots"                 : ["$MEMBERWORK/"],
        "default_remote_workdir"      : "$MEMBERWORK/%(pd.project)s",
        "rp_version"                  : "local",
        "virtenv_mode"                : "use",
        "stage_cacerts"               : true,
        "python_dist"                 : "default",
        "virtenv_dist"                : "default",
        "gpus_per_node"               : 6,
        "sockets_per_node"            : 2,
        "lfs_per_node"                : "/tmp",
        "system_architecture"         : {"smt": 4,
                                         "options": ["gpumps", "nvme"]}
    },

@AymenFJA says "specifying the path to that env in the resource_ornl.json file". Where in this file to specify the path to mocu-pilot? Is it rp_version? Thanks!

AymenFJA commented 3 years ago

Hey @darkwhite29, to make RP using a preinstalled env all you need to do is the following:

        "rp_version"                  : "installed",
        "virtenv"                     : "/path/to/your/venv/mocu-pilot/",
        "virtenv_mode"                : "use",

FYI: if you are using Anaconda Python distribution, then please change the python_dist to anaconda .

Also, I just noticed that, because I did not test these executors on Summit, so I did not create the appropriate config file for that machine. As a temporary solution is to change the following, in your resource_ornl file please:

  1. "agent_scheduler" : "NOOP",
  2. "agent_spawner" : "FUNCS",
  3. "task_launch_method" : "FUNCS",

let us know if that works.

darkwhite29 commented 3 years ago

Thanks @AymenFJA for the solution. I used miniconda instead of anaconda, but I still use anaconda for python_dist per your advice.

It is getting better now. The dill package error is gone, at least, but still the same error from the command line (all the task.0000xx folders at the client are empty as before):

(mocu2) [litan@login4.summit N7]$ python runMainForPerformanceMeasure.py
             Unstable system has been found
Round:  0 / 1 - iODE Iteration:  0  Initial MOCU:  1.520765083606797  Computation time:  2.1073079109191895
iterative:  True

================================================================================
 An HPC Workflow for MOCU on GPU
================================================================================

new session: [rp.session.login4.litan.018801.0004]                             \
database   : [mongodb://rct:****@apps.marble.ccs.ornl.gov:32020/rct_test]     ok

--------------------------------------------------------------------------------
submit pilots

create pilot manager                                                          ok
submit 1 pilot(s)
        pilot.0000   ornl.summit             672 cores      24 gpus           ok

--------------------------------------------------------------------------------
submit tasks

create task manager                                                           ok
create: ########################################################################
submit: ########################################################################

--------------------------------------------------------------------------------
gather results

wait  : Traceback (most recent call last):
  File "runMainForPerformanceMeasure.py", line 97, in <module>
    MOCUCurve, experimentSequence, timeComplexity = findMOCUSequence(criticalK, isSynchronized, MOCUInitial, K_max, w, N, deltaT, MVirtual, MReal, TVirtual, TReal, aLowerUpdated, aUpperUpdated, it_idx, update_cnt, iterative = iterative)
  File "/gpfs/alpine/csc299/scratch/litan/MOCU/Byung-Jun/new/ExaLearn-ODED-Kuramoto-main/N7/findMOCUSequence.py", line 145, in findMOCUSequence
    tmgr.wait_tasks()
  File "/ccs/home/litan/miniconda3/envs/mocu2/lib/python3.7/site-packages/radical/pilot/task_manager.py", line 1006, in wait_tasks
    time.sleep (0.1)
KeyboardInterrupt

More stuff has been generated at the client though. Below is the entire session sandbox:

rp.session.login4.litan.018801.0004.zip

By checking bootstrap_0.out I notice some error I am not familiar with. Could you please take a look? Thanks!

darkwhite29 commented 3 years ago

I confirm that running the example code here produces the same error as my workflow:

(mocu2) [litan@login3.summit test]$ python 12_task_function_1.py ornl.summit

================================================================================
 Getting Started (RP version 1.6.6)
================================================================================

new session: [rp.session.login3.litan.018801.0007]                             \
database   : [mongodb://rct:****@apps.marble.ccs.ornl.gov:32020/rct_test]     ok
read config                                                                   ok

--------------------------------------------------------------------------------
submit pilots

create pilot manager                                                          ok
submit 1 pilot(s)
        pilot.0000   ornl.summit             168 cores       6 gpus           ok

--------------------------------------------------------------------------------
submit tasks

create task manager                                                           ok
create: ########################################################################
submit: ########################################################################

--------------------------------------------------------------------------------
gather results

wait  : --------------
RADICAL Utils -- Stacktrace [57244] [MainThread]

litan     57244  53655  7 14:29 pts/9    00:00:48  |           \_ python 12_task_function_1.py ornl.summit
litan     57590  57244  0 14:29 pts/66   00:00:00  |               \_ /bin/bash -i
litan     57600  57244  0 14:29 pts/67   00:00:00  |               \_ /bin/sh -i
Traceback (most recent call last):
File "12_task_function_1.py", line 116, in <module>
tmgr.wait_tasks()
File "/ccs/home/litan/miniconda3/envs/mocu2/lib/python3.7/site-packages/radical/pilot/task_manager.py", line 1006, in wait_tasks
time.sleep (0.1)
KeyboardInterrupt

--------------
exit requested

--------------------------------------------------------------------------------
finalize

closing session rp.session.login3.litan.018801.0007                    \
close task manager                                                            ok
close pilot manager                                                            \
wait for 1 pilot(s)
              0                                                               ok
                                                                              ok
+ rp.session.login3.litan.018801.0007 (json)
+ pilot.0000 (profiles)
+ pilot.0000 (logfiles)
session lifetime: 673.3s                                                      ok

--------------------------------------------------------------------------------
AymenFJA commented 3 years ago

Hey @darkwhite29 , checking the radical.log, I see:

  File "/autofs/nccs-svm1_home1/litan/miniconda3/envs/mocu-pilot/lib/python3.7/site-packages/radical/pilot/utils/component.py", line 480, in _worker_thread
    self._initialize()
  File "/autofs/nccs-svm1_home1/litan/miniconda3/envs/mocu-pilot/lib/python3.7/site-packages/radical/pilot/utils/component.py", line 623, in _initialize
    self.initialize()
  File "/autofs/nccs-svm1_home1/litan/miniconda3/envs/mocu-pilot/lib/python3.7/site-packages/radical/pilot/agent/executing/funcs.py", line 110, in initialize
    self._spawn(self._launcher, funcs)
  File "/autofs/nccs-svm1_home1/litan/miniconda3/envs/mocu-pilot/lib/python3.7/site-packages/radical/pilot/agent/executing/funcs.py", line 145, in _spawn
    launch_cmd, hop_cmd = launcher.construct_command(funcs, fname)
  File "/autofs/nccs-svm1_home1/litan/miniconda3/envs/mocu-pilot/lib/python3.7/site-packages/radical/pilot/agent/launch_method/jsrun.py", line 138, in construct_command
    task_sandbox = t['task_sandbox_path']
KeyError: 'task_sandbox_path'

It shows that the agent_launch_method is JSRUN which is not compatible with the current agent, can you change it to: "agent_launch_method" : "SSH", and give it a try, please?

darkwhite29 commented 3 years ago

Unfortunately still the same error:

(mocu2) [litan@login5.summit test]$ python 12_task_function_1.py ornl.summit

================================================================================
 Getting Started (RP version 1.6.6)
================================================================================

new session: [rp.session.login5.litan.018802.0001]                             \
database   : [mongodb://rct:****@apps.marble.ccs.ornl.gov:32020/rct_test]     ok
read config                                                                   ok

--------------------------------------------------------------------------------
submit pilots

create pilot manager                                                          ok
submit 1 pilot(s)
        pilot.0000   ornl.summit             168 cores       6 gpus           ok

--------------------------------------------------------------------------------
submit tasks

create task manager                                                           ok
create: ########################################################################
submit: ########################################################################

--------------------------------------------------------------------------------
gather results

wait  : --------------
RADICAL Utils -- Stacktrace [872] [MainThread]

litan       872  60887  6 11:12 pts/38   00:04:17  |           \_ python 12_task_function_1.py ornl.summit
litan      1361    872  0 11:12 pts/48   00:00:00  |               \_ /bin/bash -i
litan      1378    872  0 11:12 pts/49   00:00:00  |               \_ /bin/sh -i
Traceback (most recent call last):
File "12_task_function_1.py", line 116, in <module>
tmgr.wait_tasks()
File "/ccs/home/litan/miniconda3/envs/mocu2/lib/python3.7/site-packages/radical/pilot/task_manager.py", line 1006, in wait_tasks
time.sleep (0.1)
KeyboardInterrupt

--------------
exit requested

--------------------------------------------------------------------------------
finalize

closing session rp.session.login5.litan.018802.0001                    \
close task manager                                                            ok
close pilot manager                                                            \
wait for 1 pilot(s)
              0                                                               ok
                                                                              ok
+ rp.session.login5.litan.018802.0001 (json)
+ pilot.0000 (profiles)
+ pilot.0000 (logfiles)
session lifetime: 3692.6s                                                     ok

--------------------------------------------------------------------------------

The entire session sandbox is attached below:

rp.session.login5.litan.018802.0001.zip

I noticed in our submitted papers, RAPTOR has been tested on Summit before. @andre-merzky do you have a tested resource_ornl.json for Summit? Thank you so much.

AymenFJA commented 3 years ago

Hi @darkwhite29 ,

From the log files, I can see that the executor is up and running, you can see that under func_exec.0000 folder, but from the func_exec.0000.err I noticed that it is a file system permission issue:

Traceback (most recent call last):
  File "/autofs/nccs-svm1_home1/litan/miniconda3/envs/mocu-pilot/bin/radical-pilot-agent-funcs", line 316, in <module>
    executor = Executor()
  File "/autofs/nccs-svm1_home1/litan/miniconda3/envs/mocu-pilot/bin/radical-pilot-agent-funcs", line 66, in __init__
    self._initialize()
  File "/autofs/nccs-svm1_home1/litan/miniconda3/envs/mocu-pilot/bin/radical-pilot-agent-funcs", line 96, in _initialize
    self._zmq_req  = ru.zmq.Getter(channel='funcs_req_queue', url=addr_req)
  File "/autofs/nccs-svm1_home1/litan/miniconda3/envs/mocu-pilot/lib/python3.7/site-packages/radical/utils/zmq/queue.py", line 536, in __init__
    self._log   = Logger(name=self._uid, ns='radical.utils')
  File "/autofs/nccs-svm1_home1/litan/miniconda3/envs/mocu-pilot/lib/python3.7/site-packages/radical/utils/logger.py", line 286, in __init__
    elif t in ['.']               : h = FSHandler("%s/%s.log" % (p, n))
  File "/autofs/nccs-svm1_home1/litan/miniconda3/envs/mocu-pilot/lib/python3.7/site-packages/radical/utils/logger.py", line 174, in __init__
    logging.FileHandler.__init__(self, target)
  File "/ccs/home/litan/miniconda3/envs/mocu-pilot/lib/python3.7/logging/__init__.py", line 1087, in __init__
    StreamHandler.__init__(self, self._open())
  File "/ccs/home/litan/miniconda3/envs/mocu-pilot/lib/python3.7/logging/__init__.py", line 1116, in _open
    return open(self.baseFilename, self.mode, encoding=self.encoding)
OSError: [Errno 30] Read-only file system: '/autofs/nccs-svm1_home1/litan/funcs_req_queue.get.0000.log'

I do not know how to proceed from here, @andre-merzky any suggestions, please?

darkwhite29 commented 3 years ago

Thanks @AymenFJA for the diagnosis. I confirm I was using the GPFS filesystem on Summit which is read and write allowable during runs:

[litan@login5.summit N7]$ pwd
/gpfs/alpine/csc299/scratch/litan/MOCU/Byung-Jun/new/ExaLearn-ODED-Kuramoto-main/N7

However, my virtual environment is in the /home directory which is read-only though. All my previously used virtual environments are in the /home directory and all previous RCT runs were free of this issue. Maybe using RAPTOR writes the virtual environment somehow at runtime?

darkwhite29 commented 3 years ago

I reinstalled conda in GPFS and reran the workflow, still getting the same error from func_exec.0000.err:

Traceback (most recent call last):
  File "/gpfs/alpine/csc299/scratch/litan/miniconda3/envs/mocu/bin/radical-pilot-agent-funcs", line 316, in <module>
    executor = Executor()
  File "/gpfs/alpine/csc299/scratch/litan/miniconda3/envs/mocu/bin/radical-pilot-agent-funcs", line 66, in __init__
    self._initialize()
  File "/gpfs/alpine/csc299/scratch/litan/miniconda3/envs/mocu/bin/radical-pilot-agent-funcs", line 96, in _initialize
    self._zmq_req  = ru.zmq.Getter(channel='funcs_req_queue', url=addr_req)
  File "/gpfs/alpine/csc299/scratch/litan/miniconda3/envs/mocu/lib/python3.7/site-packages/radical/utils/zmq/queue.py", line 536, in __init__
    self._log   = Logger(name=self._uid, ns='radical.utils')
  File "/gpfs/alpine/csc299/scratch/litan/miniconda3/envs/mocu/lib/python3.7/site-packages/radical/utils/logger.py", line 286, in __init__
    elif t in ['.']               : h = FSHandler("%s/%s.log" % (p, n))
  File "/gpfs/alpine/csc299/scratch/litan/miniconda3/envs/mocu/lib/python3.7/site-packages/radical/utils/logger.py", line 174, in __init__
    logging.FileHandler.__init__(self, target)
  File "/gpfs/alpine/csc299/scratch/litan/miniconda3/envs/mocu/lib/python3.7/logging/__init__.py", line 1087, in __init__
    StreamHandler.__init__(self, self._open())
  File "/gpfs/alpine/csc299/scratch/litan/miniconda3/envs/mocu/lib/python3.7/logging/__init__.py", line 1116, in _open
    return open(self.baseFilename, self.mode, encoding=self.encoding)
OSError: [Errno 30] Read-only file system: '/autofs/nccs-svm1_home1/litan/funcs_req_queue.get.0000.log'

The function log somehow is still written to the /home directory. Any ideas at this point?

andre-merzky commented 3 years ago

@AymenFJA : it looks like the func executor gets $HOME as sandbox. I would suggest to add this to the executor startup script in executing/funcs.py around line 168:


fout.write('cd %s\n' % sandbox)
darkwhite29 commented 3 years ago

I confirm the fix @andre-merzky provided works! Thanks @andre-merzky ! Have a nice weekend.

darkwhite29 commented 3 years ago

A quick question @andre-merzky: For using RAPTOR, do I need to make any changes to resource_ornl.json as I did here, or I can just use the default one? Thanks!

andre-merzky commented 3 years ago

Hey @darkwhite29 - you should be able to use the released version of RP, and you don't need any changes to the configs.

darkwhite29 commented 3 years ago

Thanks a lot @andre-merzky for the follow-up. Thanks a lot @AymenFJA for the persistent help. I think now it is good to close this ticket.