Closed AymenFJA closed 1 year ago
radical-pilot-raptor-worker
not being found points to an environment or deployment problem. This seems to use /tmp/ve3/bin/python3
- is that directory available on the compute nodes?
after fixing path in pilot.prepare_env
for RAPTOR example, the run went through, but worker got MPI issues
In master.000000.worker.0000.err
Launch commands
/opt/ibm/jsm/bin/jsrun -n1 -a1 -c1 -g0 -b rs $RP_TASK_SANDBOX/master.000000.exec.sh
/opt/ibm/jsm/bin/jsrun -n40 -a1 -c1 -g0 -b rs $RP_TASK_SANDBOX/master.000000.worker.0000.exec.sh
Reporter info
Pilot sandbox - rp.session.login5.matitov.019546.0000.tar.gz
Could it be that some modules are not loaded? The raptor env is based on bs0_pre.env
, and that point the pre_exec
part of the resource config is not done, yet. The prepare_env
call should probably use the same pre_exec
sequence.
Env setup remains a pain :-( It might be easier to manually prepare an env to be used by the pilot and raptor (basically the client env + mpi4py
).
Issue was in LD_LIBRARY_PATH
env variable, we do reset it within ...worker.0000.exec.sh
after starting it by a corresponding launch method (jsrun
and mpirun
). After adding the following for a worker task, then basic RAPTOR example worked fine
"pre_exec" : ["export LD_LIBRARY_PATH=/sw/summit/spack-envs/base/opt/linux-rhel8-ppc64le/gcc-9.1.0/spectrum-mpi-10.4.0.3-20210112-6jbupg3thjwhsabgevk6xmwhd2bbyxdc/container/../lib/pami_port:${LD_LIBRARY_PATH}"]
p.s.
LD_LIBRARY_PATH='/sw/summit/spack-envs/base/opt/linux-rhel8-ppc64le/gcc-9.1.0/spectrum-mpi-10.4.0.3-20210112-6jbupg3thjwhsabgevk6xmwhd2bbyxdc/lib:/sw/summit/gcc/9.1.0-alpha+20190716/lib64:/opt/ibm/spectrumcomputing/lsf/10.1.0.11/linux3.10-glibc2.17-ppc64le-csm/lib:/sw/summit/spack-envs/base/opt/linux-rhel8-ppc64le/gcc-8.3.1/darshan-runtime-3.4.0-kjycgqekqo72q2f4xzrpwcnb6j4kl4ed/lib:/opt/ibm/jsm/lib'
(from our env script /gpfs/alpine/geo111/scratch/matitov/radical.pilot.sandbox/rp.session.login2.matitov.019550.0003/pilot.0000/env/rp_named_env.rp.jsrun.sh
)hello_mpi
example - LD_LIBRARY_PATH='/sw/summit/spack-envs/base/opt/linux-rhel8-ppc64le/xl-16.1.1-10/spectrum-mpi-10.4.0.3-20210112-dzedzfvocsuzkm4jkqe7o64x53yhq7nm/container/../lib/pami_port:/sw/summit/spack-envs/base/opt/linux-rhel8-ppc64le/xl-16.1.1-10/spectrum-mpi-10.4.0.3-20210112-dzedzfvocsuzkm4jkqe7o64x53yhq7nm/container/../lib:/sw/summit/spack-envs/base/opt/linux-rhel8-ppc64le/xl-16.1.1-10/spectrum-mpi-10.4.0.3-20210112-dzedzfvocsuzkm4jkqe7o64x53yhq7nm/container/../lib/pami_port:pami_port:/sw/summit/spack-envs/base/opt/linux-rhel8-ppc64le/xl-16.1.1-10/spectrum-mpi-10.4.0.3-20210112-dzedzfvocsuzkm4jkqe7o64x53yhq7nm/container/../lib:/sw/summit/spack-envs/base/opt/linux-rhel8-ppc64le/xl-16.1.1-10/spectrum-mpi-10.4.0.3-20210112-dzedzfvocsuzkm4jkqe7o64x53yhq7nm/container/../lib/pami_port:/opt/ibm/jsm/lib/:/opt/ibm/csm/lib/:/sw/summit/spack-envs/base/opt/linux-rhel8-ppc64le/gcc-8.3.1/darshan-runtime-3.4.0-kjycgqekqo72q2f4xzrpwcnb6j4kl4ed/lib:/sw/summit/spack-envs/base/opt/linux-rhel8-ppc64le/xl-16.1.1-10/spectrum-mpi-10.4.0.3-20210112-dzedzfvocsuzkm4jkqe7o64x53yhq7nm/lib:/sw/summit/xl/16.1.1-10/xlsmp/5.1.1/lib:/sw/summit/xl/16.1.1-10/xlmass/9.1.1/lib:/sw/summit/xl/16.1.1-10/xlC/16.1.1/lib:/sw/summit/xl/16.1.1-10/xlf/16.1.1/lib:/sw/summit/xl/16.1.1-10/lib:/opt/ibm/spectrumcomputing/lsf/10.1.0.11/linux3.10-glibc2.17-ppc64le-csm/lib'
Found a workaround, needs a general-purpose solution.
related to #2902.
Testing raptor.py on Summit led to the following error:
The used environment: