radical-cybertools / radical.pilot

RADICAL-Pilot
http://radical-cybertools.github.io/radical-pilot/index.html
Other
54 stars 23 forks source link

RADICAL-Pilot RAPTOR fails on ORNL Summit #2673

Closed AymenFJA closed 1 year ago

AymenFJA commented 2 years ago

This ticket is related to our new use case, Cylon, it's blocking us from starting an initial scaling test on Summit.

As a first step, we are trying to runraptor.py (executables and function (MPI/non-MPI)) on Summit from login Node / Interactively with Jsrun fails with multiple scenarios:

Scenario 1 RP creates new Conda env for RAPTOR:

If I instruct RP to create RAPTOR environment with Conda with this setup:

 pilot.prepare_env(env_name='ve_raptor',
                    env_spec={'type'   : 'conda',
                              'version': '3.7',
                              'pre_exec': ["module unload xl xalt ",
                                           "module load   gcc/9.1.0",
                                           "module load   python/3.7-anaconda3"],
                              #'path'   : '/ccs/home/aymen/.conda/envs/rp-spectrum-cylon',
                              'setup'  : ['/ccs/home/aymen/RADICAL/Cylon/radical.pilot','mpi4py']})

Further, Summit instructions here to install mpi4py with Conda show the following:

MPICC="mpicc -shared" pip install --no-binary=mpi4py mpi4py

I do not know how to instruct RP to install mpi4py with these specific flags shown above.

I get this error under task.exe.c.000000:

task.exe.c.000000/task.exe.c.000000.err:Error: Requested job constraints cannot be met (See /tmp/jsm.batch4.15609/3526570/saved_resources-4_partially_filled for location of partially fullfilled allocation)

RAPTOR master and workers have no logs.

Scenario 2: RAPTOR uses an existing Conda env:

If I instruct RP to use RAPTOR environment with Conda with this setup (note that I followed Summit instructions to install mpi4py):

        pilot.prepare_env(env_name='ve_raptor',
                          env_spec={'type'   : 'conda',
                                    'version': '3.7',
                                    'pre_exec': ["module unload xl xalt ",
                                                 "module load   gcc/9.1.0",
                                                 "module load   python/3.7-anaconda3"],
                                    'path'   : '/ccs/home/aymen/.conda/envs/rp-spectrum-cylon',
                                    'setup'  : []})

RAPTOR and RP tasks hang on wait with no error in logs.

Scenario 3 runs only MPI executables with RP via 00_getting_started.py:

            td = rp.TaskDescription()
            td.stage_on_error = True
            td.executable     = 'python3'
            td.arguments      = ['/ccs/home/aymen/hello_mpi.py']
            td.cpu_processes  = 10
            td.cpu_process_type = rp.MPI

I get the same error from scenario 1:

Requested job constraints cannot be met (See /tmp/jsm.batch3.15609/3526656/saved_resources-2_partially_filled for location of partially fullfilled allocation)

Note that, If I use the same environment and the same setup and example but I submit it via:

#!/bin/bash
#BSUB -P CSC449
#BSUB -q debug
#BSUB -W 00:05
#BSUB -nnodes 1
#BSUB -J mpi4py
#BSUB -o mpi4py.%J.out
#BSUB -e mpi4py.%J.err

cd $LSB_OUTDIR
date

module unload xl
module unload xalt
module load   gcc/9.1.0
module load   python/3.7-anaconda3

source activate rp-spectrum-cylon

#jsrun --erf_input temp.rs  python3 hello_mpi.py

jsrun -n1 -r1 -a42 -c42 python3 hello_mpi.py
bsub -L $SHELL submit_hello.lsf

It works just fine with the following output:

[aymen@login3.summit ~]$ cat mpi4py.2387916.out
Thu Aug 25 12:32:21 EDT 2022
Hello from MPI rank 5 !
Hello from MPI rank 7 !
Hello from MPI rank 19 !
Hello from MPI rank 22 !
Hello from MPI rank 16 !
Hello from MPI rank 8 !
Hello from MPI rank 37 !
Hello from MPI rank 0 !
Hello from MPI rank 33 !
Hello from MPI rank 9 !
Hello from MPI rank 11 !
Hello from MPI rank 40 !
Hello from MPI rank 4 !
Hello from MPI rank 10 !
Hello from MPI rank 3 !
Hello from MPI rank 1 !
Hello from MPI rank 13 !
Hello from MPI rank 32 !
Hello from MPI rank 2 !
Hello from MPI rank 17 !
Hello from MPI rank 6 !
Hello from MPI rank 12 !
Hello from MPI rank 14 !
Hello from MPI rank 15 !
Hello from MPI rank 18 !
Hello from MPI rank 31 !
Hello from MPI rank 21 !
Hello from MPI rank 24 !
Hello from MPI rank 25 !
Hello from MPI rank 35 !
Hello from MPI rank 28 !
Hello from MPI rank 26 !
Hello from MPI rank 23 !
Hello from MPI rank 39 !
Hello from MPI rank 30 !
Hello from MPI rank 27 !
Hello from MPI rank 29 !
Hello from MPI rank 36 !
Hello from MPI rank 38 !
Hello from MPI rank 41 !
Hello from MPI rank 34 !
Hello from MPI rank 20 !

------------------------------------------------------------
Sender: LSF System <lsfadmin@batch4>
Subject: Job 2387916: <mpi4py> in cluster <summit> Done

Job <mpi4py> was submitted from host <login3> by user <aymen> in cluster <summit> at Thu Aug 25 12:29:51 2022
Job was executed on host(s) <1*batch4>, in queue <debug>, as user <aymen> in cluster <summit> at Thu Aug 25 12:32:07 2022
                            <42*b19n14>
</ccs/home/aymen> was used as the home directory.
</ccs/home/aymen> was used as the working directory.
Started at Thu Aug 25 12:32:07 2022
Terminated at Thu Aug 25 12:32:25 2022
Results reported at Thu Aug 25 12:32:25 2022

The output (if any) is above this job summary.

PS:

Read file <mpi4py.2387916.err> for stderr output of this job.

BUT: if I use the same RP launching script for any task with the batch script above:

/opt/ibm/jsm/bin/jsrun --erf_input /gpfs/alpine/scratch/aymen/csc449/radical.pilot.sandbox/rp.session.batch3.aymen.019229.0003/pilot.0000/task.000000//task.000000.rs python3 hello_mpi.py

I get the same error:

Error: Requested job constraints cannot be met (See /tmp/jsm.batch2.15609/3526672/saved_resources-1_partially_filled for location of partially fullfilled allocation)

FYI: These are only 3 scenarios. I will keep updating with more as I am testing more options now.

andre-merzky commented 2 years ago

@AymenFJA : can you please add the following lines on top of /ccs/home/aymen/hello_mpi.py:

import radical.utils as ru
ru.env_dump(script_path='hello.env')

and attach the resulting env files for all 3 scenarios? Thank you!

AymenFJA commented 2 years ago

@andre-merzky : I tried the following:

import radical.utils as ru
ru.env_dump(script_path='hello.env')

from mpi4py import MPI

comm = MPI.COMM_WORLD      # Use the world communicator
mpi_rank = comm.Get_rank() # The process ID (integer 0-41 for a 42-process job)

print('Hello from MPI rank %s !' %(mpi_rank))

and this:

from mpi4py import MPI
import radical.utils as ru
ru.env_dump(script_path='hello.env')

comm = MPI.COMM_WORLD      # Use the world communicator
mpi_rank = comm.Get_rank() # The process ID (integer 0-41 for a 42-process job)

print('Hello from MPI rank %s !' %(mpi_rank))

I tried 3 scenarios, and I could not find hello.env anywhere; my assumption is that hello_mpi is even launched or executed. I keep seeing this error:

rror: Requested job constraints cannot be met (See /tmp/jsm.batch2.15609/3526915/saved_resources-11_partially_filled for location of partially fullfilled allocation)
andre-merzky commented 2 years ago

Oh, I see - let me have a look at summit...

AymenFJA commented 2 years ago

@andre-merzky , a comment from Summit support regarding the status of Jsrun erf and a workaround:

Leah Huk commented:
Aymen.
Apparently ERFs to have not been working since late last year, since the RHEL 8 OS upgrade. There is a workaround you should try in your job script: 
export JSM_ROOT=/gpfs/alpine/stf007/world-shared/vgv/inbox/jsm_erf/jsm-10.4.0.4/opt/ibm/jsm
$JSM_ROOT/bin/jsm &
$JSM_ROOT/bin/jsrun --erf_input=Your_erf ./Your_app
Leah

I tried it, and It did not work. I noticed something in this error Error: Requested job constraints cannot be met (See /tmp/jsm.batch5.15609/3534187/saved_resources-1_partially_filled for location of partially fullfilled allocation)

if I navigate to that file I see that they both point to CPU:0:

Last login: Mon Aug 29 10:34:04 2022 from login3.summit.olcf.ornl.gov
[aymen@batch5.summit ~]$ vi /tmp/jsm.batch5.15609/3534187/saved_resources-1_partially_filled
[aymen@batch5.summit ~]$ cat /tmp/jsm.batch5.15609/3534187/saved_resources-1_partially_filled
RS 0: { host: 1, cpu: 0 }
RS 1: { host: 1, cpu: 0 }
[aymen@batch5.summit ~]$

Although my rs file state CPU0, CPU1:

cpu_index_using: logical
rank: 0: { host: 1; cpu: {0}}
rank: 1: { host: 1; cpu: {1}}

Is this correct?

andre-merzky commented 2 years ago

Oh boy... Let's discuss this on the call today - but our only (short term / viable) option seems to not use ERF files for the time being. Alas that also means we can't use our scheduler but leave placement to LSF, which comes with it's own host of potential problems...

mtitov commented 2 years ago

(*) Just a remark, we do have a branch with jsrun fixes (fix/jsrun_summit) and a corresponding env variable should be set in pre_bootstrap_0 in resource config export JSM_ROOT=/gpfs/alpine/stf007/world-shared/vgv/inbox/jsm_erf/jsm-10.4.0.4/opt/ibm/jsm

AymenFJA commented 2 years ago

RP-RAPTOR seems to be partially working with a different LM (MPIRUN), executables are working just fine with this description:

            tds.append(rp.TaskDescription({
                'uid'             : 'task.exe.c.%06d' % i,
                'mode'            : rp.TASK_EXECUTABLE,
                'scheduler'       : None,
                'cpu_processes'   : 2,
                'cpu_process_type': rp.MPI,
                'executable'      : '/ccs/home/aymen/.conda/envs/rp-spectrum-cylon/bin/python',
                'arguments'       : ['/ccs/home/aymen/hello_mpi.py']}))

While, RAPTOR-workers are not being launched due to the following error:

--------------------------------------------------------------------------
A requested component was not found, or was unable to be opened.  This
means that this component is either not installed or is unable to be
used on your system (e.g., sometimes this means that shared libraries
that the component requires are unable to be found/loaded).  Note that
Open MPI stopped checking at the first component that it did not find.

Host:      f31n07
Framework: pml
Component: pami
--------------------------------------------------------------------------
--------------------------------------------------------------------------
It looks like MPI_INIT failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during MPI_INIT; some of which are due to configuration or environment
problems.  This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
developer):

  mca_pml_base_open() failed
  --> Returned "Not found" (-13) instead of "Success" (0)
--------------------------------------------------------------------------
[f31n07:384871] *** An error occurred in MPI_Init_thread
[f31n07:384871] *** reported by process [1691222017,7]
[f31n07:384871] *** on a NULL communicator
[f31n07:384871] *** Unknown error
[f31n07:384871] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[f31n07:384871] ***    and potentially your MPI job)
[batch5:2767346] 9 more processes have sent help message help-mca-base.txt / find-available:not-valid
[batch5:2767346] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
[batch5:2767346] 6 more processes have sent help message help-mpi-runtime.txt / mpi_init:startup:internal-failure
[batch5:2767346] 2 more processes have sent help message help-mpi-errors.txt / mpi_errors_are_fatal unknown handle

Andre proposed the following: 1- Remove the worker description here replace it with the task description above

AymenFJA commented 2 years ago

@andre-merzky, finally! I can confirm that: 1- using an existing conda_env in raptor env and 2- using MPIRUN as an LM 3 - tweaking the raptor worker description in the current way works:

import sys
executable = sys.executable
td = rp.TaskDescription(
                           {'uid'             : 'task.exe.c.%06d' % i,
                            #'named_env'       : descr.get('named_env'),
                            'environment'     : descr.get('environment', {}),
                            'mode'            : rp.TASK_EXECUTABLE,
                            'scheduler'       : None,
                            'cpu_processes'   : descr['cpu_processes'],
                            'cpu_process_type': rp.MPI,
                            'executable'      : executable,
                            'arguments'       : [
                                              '-c',
                                              'import radical.pilot as rp; '
                                              "rp.raptor.Worker.run('%s', '%s', '%s')"
                                                  % (descr.get('worker_file', ''),
                                                     descr.get('worker_class', 'DefaultWorker'),
                                                     fname)]})

submit 1 pilot(s)
        pilot.0000   ornl.summit_interactive     10 cores       0 gpus        ok
submit: ########################################################################
submit: ########################################################################
wait  : ########################################################################
        DONE      :    10
                                                                              ok
task.exe.c.000000 [DONE]: Hello from MPI rank 1 !
Hello from MPI rank 0 !

task.call.c.1.000000 [DONE]: ['hello: task.call.c.1.000000\n', 'hello: task.call.c.1.000000\n']
task.call.c.2.000000 [DONE]: ['hello 0/2: task.call.c.2.000000\n', 'hello 1/2: task.call.c.2.000000\n']
task.call.c.3.000000 [DONE]: ['\n']
task.mpi_ser_func.c.000000 [DONE]: ['hello 1/2: task.call.c.000000\n', 'hello 0/2: task.call.c.000000\n']
task.ser_func.c.000000 [DONE]: ['func_non_mpi\n', 'func_non_mpi\n']
task.eval.c.000000 [DONE]: ['hello 1/2: task.eval.c.000000\n', 'hello 0/2: task.eval.c.000000\n']
task.exec.c.000000 [DONE]: ['hello 0/2: task.exec.c.000000\n', 'hello 1/2: task.exec.c.000000\n']
task.proc.c.000000 [DONE]: ['hello 0/2: task.proc.c.000000\n', 'hello 1/2: task.proc.c.000000\n']
task.shell.c.000000 [DONE]: ['hello 1/2: task.shell.c.000000\n', 'hello 0/2: task.shell.c.000000\n']
closing session rp.session.batch1.aymen.019235.0007                            \

The question now is: how should we proceed?

andre-merzky commented 2 years ago

Woah, mpirun now works on summit? That's good news indeed!

The question now is: how should we proceed?

Honestly, replacing jsrun with mpirun as default launch method for ornl.summit seems like a reasonable option then... @mtitov: any opinion?

mtitov commented 2 years ago

I would think we can have another configuration for Summit with MPIRun LM, and I lean towards keeping JSRun as default LM in ornl.summit and to have MPIRun LM in ornl.summit_mpirun, but no strong opinion about which LM should be in ornl.summit.

BTW @AymenFJA , did you also check that with non-interactive run?

p.s. maybe we also can reconsider possibility to choose LM in PilotDescription: by default use first applicable LM from config (in respect of the defined order), as well as let user to pick a particular LM from a provided list of LMs

AymenFJA commented 2 years ago

@mtitov , fundamentally it should. But to confirm yes, I did try and it worked from login_node:

(rp-spectrum-cylon) [aymen@login3.summit raptor_login_node]$ python raptor.py
new session: [rp.session.login3.aymen.019235.0010]                             \
database   : [mongodb://aymen:****@95.217.193.116:27017/radical3]             ok
create pilot manager                                                          ok
create task manager                                                           ok
submit 1 pilot(s)
        pilot.0000   ornl.summit              10 cores       0 gpus           ok
submit: ########################################################################
submit: ########################################################################
wait  : ########################################################################
        DONE      :    10
                                                                              ok
task.exe.c.000000 [DONE]: Hello from MPI rank 1 !
Hello from MPI rank 0 !

task.call.c.1.000000 [DONE]: ['hello: task.call.c.1.000000\n', 'hello: task.call.c.1.000000\n']
task.call.c.2.000000 [DONE]: ['hello 1/2: task.call.c.2.000000\n', 'hello 0/2: task.call.c.2.000000\n']
task.call.c.3.000000 [DONE]: ['\n']
task.mpi_ser_func.c.000000 [DONE]: ['hello 1/2: task.call.c.000000\n', 'hello 0/2: task.call.c.000000\n']
task.ser_func.c.000000 [DONE]: ['func_non_mpi\n', 'func_non_mpi\n']
task.eval.c.000000 [DONE]: ['hello 0/2: task.eval.c.000000\n', 'hello 1/2: task.eval.c.000000\n']
task.exec.c.000000 [DONE]: ['hello 1/2: task.exec.c.000000\n', 'hello 0/2: task.exec.c.000000\n']
task.proc.c.000000 [DONE]: ['hello 1/2: task.proc.c.000000\n', 'hello 0/2: task.proc.c.000000\n']
task.shell.c.000000 [DONE]: ['hello 0/2: task.shell.c.000000\n', 'hello 1/2: task.shell.c.000000\n']
closing session rp.session.login3.aymen.019235.0010                            \
close task manager                                                            ok
close pilot manager                                                            \
wait for 1 pilot(s)
              0                                                               ok
                                                                              ok
+ rp.session.login3.aymen.019235.0010 (json)
+ pilot.0000 (profiles)
+ pilot.0000 (logfiles)
session lifetime: 145.3s                                                      ok
(rp-spectrum-cylon) [aymen@login3.summit raptor_login_node]$

Also, FYI and to consider regarding the next step (from Summit support):

Leah Huk commented:
Hi Aymen,
I'm sorry to hear the workaround didn't help. Unfortunately, we don't have
a solution right now for use with .erf files. It is a problem that will not be 
fixed until the next OS upgrade, and we don't have a timeline for when 
that will occur (our staff are focused on the Frontier launch).
Leah
andre-merzky commented 2 years ago

p.s. maybe we also can reconsider possibility to choose LM in PilotDescription: by default use first applicable LM from config (in respect of the defined order), as well as let user to pick a particular LM from a provided list of LMs

Maybe - but out of scope of that specific ticket right now :-)

mpirun vs. jsrun: if jsrun is broken then it should not be the default config - but ack on keeping the config around. Maybe rename it to ornl.summit_jsrun and make the default one for mpirun then?

mtitov commented 2 years ago

Leah Huk commented: ... Unfortunately, we don't have a solution right now for use with .erf files.

Maybe rename it to ornl.summit_jsrun and make the default one for mpirun then?

agree! (also considering that extra-confirmation on erf-files.. :) )

AymenFJA commented 2 years ago

@andre-merzky , @mtitov I agree as well. One more thing, what about the raptor_worker description should we adapt to the new changes? Two things got updated in the description above:

1- named_env is problematic, i.e. mpirun will stumble with the error above It looks like MPI_INIT failed for some reason...., unless we find a workaround. 2- get the entire python executable path and feed it to the worker description instead of just python or python3.

andre-merzky commented 2 years ago

1- named_env is problematic, i.e. mpirun will stumble with the error above It looks like MPI_INIT failed for some reason...., unless we find a workaround.

note that module load python comes with it's own deployment of mpi4py, so you actually don't need to install it in the named_env. That should hopefully resolve the MPI_Init problems - can you give it a try?

2- get the entire python executable path and feed it to the worker description instead of just python or python3.

sys.executable will not work if a named_env is used.

Either way, having said all that: it should be ok if we document that on summit, named_env is not supported for raptor master and worker tasks. Would you agree?

AymenFJA commented 2 years ago

update: I checked the path of where the Python is coming from, and it is coming from raptor_env. I still see the same issue which is if I uncomment named_env MPI fails. Note that, sys.executable is no longer needed.

andre-merzky commented 2 years ago

Please add to pre_exec: python3 -c 'import mpi4py; print(mpi4py.__file__)'

AymenFJA commented 2 years ago

I can not access Summit because my account is being renewed, it will take up to 4 weeks as the help desk mentioned.

AymenFJA commented 1 year ago

Closing this as it is outdated and in favor of #2827.