radical-cybertools / radical.pilot

RADICAL-Pilot
http://radical-cybertools.github.io/radical-pilot/index.html
Other
54 stars 23 forks source link

RAPTOR: raptor.py example fails on summit. #2980

Closed AymenFJA closed 1 year ago

AymenFJA commented 1 year ago

related to #2902.

Testing raptor.py on Summit led to the following error:

$ cat master.000000.worker.0000.err
/gpfs/alpine/scratch/matitov/geo111/radical.pilot.sandbox/rp.session.login2.matitov.019545.0001//pilot.0000//master.000000/master.000000.worker.0000.exec.sh: line 49: radical-pilot-raptor-worker: command not found

The used environment:

$ tail -n 20 ../env.log 
SCRIPT    : /autofs/nccs-svm1_home1/matitov/.conda/envs/ve.rp/bin/radical-pilot-create-static-ve
PREFIX    : /tmp/ve3
VERSION   : 3.8
MODULES   : /ccs/home/matitov/.conda/envs/ve.rp/lib/python3.9/site-packages/radical/pilot/radical.pilot-1.35.0.tar.gz /ccs/home/matitov/.conda/envs/ve.rp/lib/python3.9/site-packages/radical/utils/radical.utils-1.34.0.tar.gz mpi4py apache-libcloud chardet colorama idna msgpack msgpack-python netifaces ntplib parse dill pyzmq regex requests setproctitle urllib3
DEFAULTS  : True
PYTHON    : /tmp/ve3/bin/python3 (Python 3.8.3)
PYTHONPATH: 
RCT_STACK : 
  python               : /tmp/ve3/bin/python3
  pythonpath           : 
  version              : 3.8.3
  virtualenv           : /tmp/ve3

  radical.gtod         : 1.20.1
  radical.pilot        : 1.35.0
  radical.saga         : 1.34.0
  radical.utils        : 1.34.0
env for worker.exec.sh
$ cat  /gpfs/alpine/geo111/scratch/matitov/radical.pilot.sandbox/rp.session.login2.matitov.019545.0001/pilot.0000/env/rp_named_env.ve_raptor.jsrun.sh | grep "export PATH="
export PATH='/tmp/ve3/bin:/sw/sources/lsf-tools/2.0/summit/bin:/sw/summit/python/3.8/anaconda3/2020.07-rhel8/bin:/sw/summit/spack-envs/base/opt/linux-rhel8-ppc64le/gcc-9.1.0/spectrum-mpi-10.4.0.3-20210112-6jbupg3thjwhsabgevk6xmwhd2bbyxdc/bin:/sw/summit/spack-envs/base/opt/linux-rhel8-ppc64le/gcc-9.1.0/libzmq-4.3.3-dagzzans4hn3ofpx7ni5u4otte773qzp/bin:/sw/summit/gcc/9.1.0-alpha+20190716/bin:/opt/ibm/csm/bin:/opt/ibm/spectrumcomputing/lsf/10.1.0.11/linux3.10-glibc2.17-ppc64le-csm/bin:/sw/summit/python/3.8/anaconda3/2020.07-rhel8/condabin:/sw/summit/spack-envs/base/opt/linux-rhel8-ppc64le/gcc-8.3.1/darshan-runtime-3.4.0-kjycgqekqo72q2f4xzrpwcnb6j4kl4ed/bin:/sw/sources/hpss/bin:/opt/ibm/spectrumcomputing/lsf/10.1.0.11/linux3.10-glibc2.17-ppc64le-csm/etc:/usr/local/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/opt/ibm/flightlog/bin:/opt/ibm/jsm/bin:/sw/sources/cgroup_tool/bin:/opt/puppetlabs/bin:/usr/lpp/mmfs/bin'
andre-merzky commented 1 year ago

radical-pilot-raptor-worker not being found points to an environment or deployment problem. This seems to use /tmp/ve3/bin/python3 - is that directory available on the compute nodes?

mtitov commented 1 year ago

after fixing path in pilot.prepare_env for RAPTOR example, the run went through, but worker got MPI issues

In master.000000.worker.0000.err

``` -------------------------------------------------------------------------- A requested component was not found, or was unable to be opened. This means that this component is either not installed or is unable to be used on your system (e.g., sometimes this means that shared libraries that the component requires are unable to be found/loaded). Note that Open MPI stopped checking at the first component that it did not find. Host: h32n16 Framework: pml Component: pami -------------------------------------------------------------------------- -------------------------------------------------------------------------- A requested component was not found, or was unable to be opened. This means that this component is either not installed or is unable to be used on your system (e.g., sometimes this means that shared libraries that the component requires are unable to be found/loaded). Note that Open MPI stopped checking at the first component that it did not find. Host: h32n16 Framework: pml Component: pami -------------------------------------------------------------------------- -------------------------------------------------------------------------- A requested component was not found, or was unable to be opened. This means that this component is either not installed or is unable to be used on your system (e.g., sometimes this means that shared libraries that the component requires are unable to be found/loaded). Note that Open MPI stopped checking at the first component that it did not find. Host: h32n16 Framework: pml Component: pami -------------------------------------------------------------------------- -------------------------------------------------------------------------- A requested component was not found, or was unable to be opened. This means that this component is either not installed or is unable to be used on your system (e.g., sometimes this means that shared libraries that the component requires are unable to be found/loaded). Note that Open MPI stopped checking at the first component that it did not find. Host: h32n16 Framework: pml Component: pami -------------------------------------------------------------------------- -------------------------------------------------------------------------- A requested component was not found, or was unable to be opened. This means that this component is either not installed or is unable to be used on your system (e.g., sometimes this means that shared libraries that the component requires are unable to be found/loaded). Note that Open MPI stopped checking at the first component that it did not find. Host: h32n16 Framework: pml Component: pami -------------------------------------------------------------------------- -------------------------------------------------------------------------- A requested component was not found, or was unable to be opened. This means that this component is either not installed or is unable to be used on your system (e.g., sometimes this means that shared libraries that the component requires are unable to be found/loaded). Note that Open MPI stopped checking at the first component that it did not find. Host: h32n16 Framework: pml Component: pami -------------------------------------------------------------------------- -------------------------------------------------------------------------- A requested component was not found, or was unable to be opened. This means that this component is either not installed or is unable to be used on your system (e.g., sometimes this means that shared libraries that the component requires are unable to be found/loaded). Note that Open MPI stopped checking at the first component that it did not find. Host: h32n16 Framework: pml Component: pami -------------------------------------------------------------------------- -------------------------------------------------------------------------- A requested component was not found, or was unable to be opened. This means that this component is either not installed or is unable to be used on your system (e.g., sometimes this means that shared libraries that the component requires are unable to be found/loaded). Note that Open MPI stopped checking at the first component that it did not find. Host: h32n16 Framework: pml Component: pami -------------------------------------------------------------------------- -------------------------------------------------------------------------- A requested component was not found, or was unable to be opened. This means that this component is either not installed or is unable to be used on your system (e.g., sometimes this means that shared libraries that the component requires are unable to be found/loaded). Note that Open MPI stopped checking at the first component that it did not find. Host: h32n16 Framework: pml Component: pami -------------------------------------------------------------------------- -------------------------------------------------------------------------- A requested component was not found, or was unable to be opened. This means that this component is either not installed or is unable to be used on your system (e.g., sometimes this means that shared libraries that the component requires are unable to be found/loaded). Note that Open MPI stopped checking at the first component that it did not find. Host: h32n16 Framework: pml Component: pami -------------------------------------------------------------------------- -------------------------------------------------------------------------- A requested component was not found, or was unable to be opened. This means that this component is either not installed or is unable to be used on your system (e.g., sometimes this means that shared libraries that the component requires are unable to be found/loaded). Note that Open MPI stopped checking at the first component that it did not find. Host: h32n16 Framework: pml Component: pami -------------------------------------------------------------------------- -------------------------------------------------------------------------- It looks like MPI_INIT failed for some reason; your parallel process is likely to abort. There are many reasons that a parallel process can fail during MPI_INIT; some of which are due to configuration or environment problems. This failure appears to be an internal failure; here's some additional information (which may only be relevant to an Open MPI developer): mca_pml_base_open() failed --> Returned "Not found" (-13) instead of "Success" (0) -------------------------------------------------------------------------- -------------------------------------------------------------------------- It looks like MPI_INIT failed for some reason; your parallel process is likely to abort. There are many reasons that a parallel process can fail during MPI_INIT; some of which are due to configuration or environment problems. This failure appears to be an internal failure; here's some additional information (which may only be relevant to an Open MPI developer): mca_pml_base_open() failed --> Returned "Not found" (-13) instead of "Success" (0) -------------------------------------------------------------------------- -------------------------------------------------------------------------- It looks like MPI_INIT failed for some reason; your parallel process is likely to abort. There are many reasons that a parallel process can fail during MPI_INIT; some of which are due to configuration or environment problems. This failure appears to be an internal failure; here's some additional information (which may only be relevant to an Open MPI developer): mca_pml_base_open() failed --> Returned "Not found" (-13) instead of "Success" (0) -------------------------------------------------------------------------- -------------------------------------------------------------------------- It looks like MPI_INIT failed for some reason; your parallel process is likely to abort. There are many reasons that a parallel process can fail during MPI_INIT; some of which are due to configuration or environment problems. This failure appears to be an internal failure; here's some additional information (which may only be relevant to an Open MPI developer): mca_pml_base_open() failed --> Returned "Not found" (-13) instead of "Success" (0) -------------------------------------------------------------------------- -------------------------------------------------------------------------- It looks like MPI_INIT failed for some reason; your parallel process is likely to abort. There are many reasons that a parallel process can fail during MPI_INIT; some of which are due to configuration or environment problems. This failure appears to be an internal failure; here's some additional information (which may only be relevant to an Open MPI developer): mca_pml_base_open() failed --> Returned "Not found" (-13) instead of "Success" (0) -------------------------------------------------------------------------- -------------------------------------------------------------------------- It looks like MPI_INIT failed for some reason; your parallel process is likely to abort. There are many reasons that a parallel process can fail during MPI_INIT; some of which are due to configuration or environment problems. This failure appears to be an internal failure; here's some additional information (which may only be relevant to an Open MPI developer): mca_pml_base_open() failed --> Returned "Not found" (-13) instead of "Success" (0) -------------------------------------------------------------------------- -------------------------------------------------------------------------- It looks like MPI_INIT failed for some reason; your parallel process is likely to abort. There are many reasons that a parallel process can fail during MPI_INIT; some of which are due to configuration or environment problems. This failure appears to be an internal failure; here's some additional information (which may only be relevant to an Open MPI developer): mca_pml_base_open() failed --> Returned "Not found" (-13) instead of "Success" (0) -------------------------------------------------------------------------- -------------------------------------------------------------------------- It looks like MPI_INIT failed for some reason; your parallel process is likely to abort. There are many reasons that a parallel process can fail during MPI_INIT; some of which are due to configuration or environment problems. This failure appears to be an internal failure; here's some additional information (which may only be relevant to an Open MPI developer): mca_pml_base_open() failed --> Returned "Not found" (-13) instead of "Success" (0) -------------------------------------------------------------------------- -------------------------------------------------------------------------- It looks like MPI_INIT failed for some reason; your parallel process is likely to abort. There are many reasons that a parallel process can fail during MPI_INIT; some of which are due to configuration or environment problems. This failure appears to be an internal failure; here's some additional information (which may only be relevant to an Open MPI developer): mca_pml_base_open() failed --> Returned "Not found" (-13) instead of "Success" (0) -------------------------------------------------------------------------- -------------------------------------------------------------------------- It looks like MPI_INIT failed for some reason; your parallel process is likely to abort. There are many reasons that a parallel process can fail during MPI_INIT; some of which are due to configuration or environment problems. This failure appears to be an internal failure; here's some additional information (which may only be relevant to an Open MPI developer): mca_pml_base_open() failed --> Returned "Not found" (-13) instead of "Success" (0) -------------------------------------------------------------------------- -------------------------------------------------------------------------- It looks like MPI_INIT failed for some reason; your parallel process is likely to abort. There are many reasons that a parallel process can fail during MPI_INIT; some of which are due to configuration or environment problems. This failure appears to be an internal failure; here's some additional information (which may only be relevant to an Open MPI developer): mca_pml_base_open() failed --> Returned "Not found" (-13) instead of "Success" (0) -------------------------------------------------------------------------- [h32n16:1356816] *** An error occurred in MPI_Init_thread [h32n16:1356816] *** reported by process [3,13] [h32n16:1356816] *** on a NULL communicator [h32n16:1356816] *** Unknown error [h32n16:1356816] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort, [h32n16:1356816] *** and potentially your MPI job) [h32n16:1356885] *** An error occurred in MPI_Init_thread [h32n16:1356885] *** reported by process [3,35] [h32n16:1356885] *** on a NULL communicator [h32n16:1356885] *** Unknown error [h32n16:1356885] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort, [h32n16:1356885] *** and potentially your MPI job) [h32n16:1356899] *** An error occurred in MPI_Init_thread [h32n16:1356899] *** reported by process [3,39] [h32n16:1356899] *** on a NULL communicator [h32n16:1356899] *** Unknown error [h32n16:1356899] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort, [h32n16:1356899] *** and potentially your MPI job) [h32n16:1356668] *** An error occurred in MPI_Init_thread [h32n16:1356668] *** reported by process [3,1] [h32n16:1356668] *** on a NULL communicator [h32n16:1356668] *** Unknown error [h32n16:1356668] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort, [h32n16:1356668] *** and potentially your MPI job) [h32n16:1356664] *** An error occurred in MPI_Init_thread [h32n16:1356664] *** reported by process [3,5] [h32n16:1356664] *** on a NULL communicator [h32n16:1356664] *** Unknown error [h32n16:1356664] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort, [h32n16:1356664] *** and potentially your MPI job) [h32n16:1356671] *** An error occurred in MPI_Init_thread [h32n16:1356671] *** reported by process [3,7] [h32n16:1356671] *** on a NULL communicator [h32n16:1356671] *** Unknown error [h32n16:1356671] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort, [h32n16:1356671] *** and potentially your MPI job) [h32n16:1356821] *** An error occurred in MPI_Init_thread [h32n16:1356821] *** reported by process [3,16] [h32n16:1356821] *** on a NULL communicator [h32n16:1356821] *** Unknown error [h32n16:1356821] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort, [h32n16:1356821] *** and potentially your MPI job) [h32n16:1356665] *** An error occurred in MPI_Init_thread [h32n16:1356665] *** reported by process [3,10] [h32n16:1356665] *** on a NULL communicator [h32n16:1356665] *** Unknown error [h32n16:1356665] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort, [h32n16:1356665] *** and potentially your MPI job) [h32n16:1356666] *** An error occurred in MPI_Init_thread [h32n16:1356666] *** reported by process [3,2] [h32n16:1356666] *** on a NULL communicator [h32n16:1356666] *** Unknown error [h32n16:1356666] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort, [h32n16:1356666] *** and potentially your MPI job) [h32n16:1356897] *** An error occurred in MPI_Init_thread [h32n16:1356897] *** reported by process [3,33] [h32n16:1356897] *** on a NULL communicator [h32n16:1356897] *** Unknown error [h32n16:1356897] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort, [h32n16:1356897] *** and potentially your MPI job) [h32n16:1356670] *** An error occurred in MPI_Init_thread [h32n16:1356670] *** reported by process [3,3] [h32n16:1356670] *** on a NULL communicator [h32n16:1356670] *** Unknown error [h32n16:1356670] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort, [h32n16:1356670] *** and potentially your MPI job) ```

Launch commands

/opt/ibm/jsm/bin/jsrun -n1 -a1 -c1 -g0 -b rs $RP_TASK_SANDBOX/master.000000.exec.sh
/opt/ibm/jsm/bin/jsrun -n40 -a1 -c1 -g0 -b rs $RP_TASK_SANDBOX/master.000000.worker.0000.exec.sh

Reporter info

``` ================================================================================ Raptor example (RP version 1.35.0) ================================================================================ new session: [rp.session.login5.matitov.019546.0000] \ database : [mongodb://rct:****@apps.marble.ccs.ornl.gov:32020/rct_test] ok create pilot manager ok create task manager ok submit 1 pilot(s) pilot.0000 ornl.summit_jsrun 120 cores 0 gpus ok wait for 1 pilot(s) 0 ok Stage files for the worker `my_hello` command. Call pilot.prepare_env()... done Submit raptor master(s) ['master.000000'] submit: ######################################################################## wait : ######################################################################## AGENT_EXECUTING: 1 ok Submit non-raptor task(s) ['task.exe.c.000000'] Prepare raptor tasks. Submit tasks ['task.exe.c.000000', 'task.call.c.000000', 'task.call_mpi.c.000000', 'task.call.c.3.000000', 'task.mpi_ser_func.c.000000', 'task.ser_func.c.000000', 'task.eval.c.000000', 'task.exec.c.000000', 'task.proc.c.000000', 'task.shell.c.000000']. submit: ######################################################################## wait : ####### AGENT_SCHEDULING: 9 DONE : 1 timeout id: task.exe.c.000000 [DONE]: out: hello 0/2: task.exe.c.000000 hello 1/2: task.exe.c.000000 ret: None id: task.call.c.000000 [AGENT_SCHEDULING]: out: None ret: None id: task.call_mpi.c.000000 [AGENT_SCHEDULING]: out: None ret: None id: task.call.c.3.000000 [AGENT_SCHEDULING]: out: None ret: None id: task.mpi_ser_func.c.000000 [AGENT_SCHEDULING]: out: None ret: None id: task.ser_func.c.000000 [AGENT_SCHEDULING]: out: None ret: None id: task.eval.c.000000 [AGENT_SCHEDULING]: out: None ret: None id: task.exec.c.000000 [AGENT_SCHEDULING]: out: None ret: None id: task.proc.c.000000 [AGENT_SCHEDULING]: out: None ret: None id: task.shell.c.000000 [AGENT_SCHEDULING]: out: None ret: None closing session rp.session.login5.matitov.019546.0000 \ close task manager ok close pilot manager \ wait for 1 pilot(s) 0 ok ok + rp.session.login5.matitov.019546.0000 (json) + pilot.0000 (profiles) + pilot.0000 (logfiles) session lifetime: 700.7s ok Logs from the master task should now be in local files like rp.session.login5.matitov.019546.0000/pilot.0000/master.000000.log ```

Pilot sandbox - rp.session.login5.matitov.019546.0000.tar.gz

andre-merzky commented 1 year ago

Could it be that some modules are not loaded? The raptor env is based on bs0_pre.env, and that point the pre_exec part of the resource config is not done, yet. The prepare_env call should probably use the same pre_exec sequence.

Env setup remains a pain :-( It might be easier to manually prepare an env to be used by the pilot and raptor (basically the client env + mpi4py).

mtitov commented 1 year ago

Issue was in LD_LIBRARY_PATH env variable, we do reset it within ...worker.0000.exec.sh after starting it by a corresponding launch method (jsrun and mpirun). After adding the following for a worker task, then basic RAPTOR example worked fine "pre_exec" : ["export LD_LIBRARY_PATH=/sw/summit/spack-envs/base/opt/linux-rhel8-ppc64le/gcc-9.1.0/spectrum-mpi-10.4.0.3-20210112-6jbupg3thjwhsabgevk6xmwhd2bbyxdc/container/../lib/pami_port:${LD_LIBRARY_PATH}"]

p.s.

mturilli commented 1 year ago

Found a workaround, needs a general-purpose solution.