Closed AymenFJA closed 1 year ago
@AymenFJA : can you please add the following lines on top of /ccs/home/aymen/hello_mpi.py
:
import radical.utils as ru
ru.env_dump(script_path='hello.env')
and attach the resulting env files for all 3 scenarios? Thank you!
@andre-merzky : I tried the following:
import radical.utils as ru
ru.env_dump(script_path='hello.env')
from mpi4py import MPI
comm = MPI.COMM_WORLD # Use the world communicator
mpi_rank = comm.Get_rank() # The process ID (integer 0-41 for a 42-process job)
print('Hello from MPI rank %s !' %(mpi_rank))
and this:
from mpi4py import MPI
import radical.utils as ru
ru.env_dump(script_path='hello.env')
comm = MPI.COMM_WORLD # Use the world communicator
mpi_rank = comm.Get_rank() # The process ID (integer 0-41 for a 42-process job)
print('Hello from MPI rank %s !' %(mpi_rank))
I tried 3 scenarios, and I could not find hello.env
anywhere; my assumption is that hello_mpi
is even launched or executed. I keep seeing this error:
rror: Requested job constraints cannot be met (See /tmp/jsm.batch2.15609/3526915/saved_resources-11_partially_filled for location of partially fullfilled allocation)
Oh, I see - let me have a look at summit...
@andre-merzky , a comment from Summit support regarding the status of Jsrun erf
and a workaround:
Leah Huk commented:
Aymen.
Apparently ERFs to have not been working since late last year, since the RHEL 8 OS upgrade. There is a workaround you should try in your job script:
export JSM_ROOT=/gpfs/alpine/stf007/world-shared/vgv/inbox/jsm_erf/jsm-10.4.0.4/opt/ibm/jsm
$JSM_ROOT/bin/jsm &
$JSM_ROOT/bin/jsrun --erf_input=Your_erf ./Your_app
Leah
I tried it, and It did not work.
I noticed something in this error Error: Requested job constraints cannot be met (See /tmp/jsm.batch5.15609/3534187/saved_resources-1_partially_filled for location of partially fullfilled allocation)
if I navigate to that file I see that they both point to CPU:0:
Last login: Mon Aug 29 10:34:04 2022 from login3.summit.olcf.ornl.gov
[aymen@batch5.summit ~]$ vi /tmp/jsm.batch5.15609/3534187/saved_resources-1_partially_filled
[aymen@batch5.summit ~]$ cat /tmp/jsm.batch5.15609/3534187/saved_resources-1_partially_filled
RS 0: { host: 1, cpu: 0 }
RS 1: { host: 1, cpu: 0 }
[aymen@batch5.summit ~]$
Although my rs
file state CPU0
, CPU1
:
cpu_index_using: logical
rank: 0: { host: 1; cpu: {0}}
rank: 1: { host: 1; cpu: {1}}
Is this correct?
Oh boy... Let's discuss this on the call today - but our only (short term / viable) option seems to not use ERF files for the time being. Alas that also means we can't use our scheduler but leave placement to LSF, which comes with it's own host of potential problems...
(*) Just a remark, we do have a branch with jsrun fixes (fix/jsrun_summit) and a corresponding env variable should be set in pre_bootstrap_0
in resource config export JSM_ROOT=/gpfs/alpine/stf007/world-shared/vgv/inbox/jsm_erf/jsm-10.4.0.4/opt/ibm/jsm
RP-RAPTOR seems to be partially working with a different LM (MPIRUN
), executables are working just fine with this description:
tds.append(rp.TaskDescription({
'uid' : 'task.exe.c.%06d' % i,
'mode' : rp.TASK_EXECUTABLE,
'scheduler' : None,
'cpu_processes' : 2,
'cpu_process_type': rp.MPI,
'executable' : '/ccs/home/aymen/.conda/envs/rp-spectrum-cylon/bin/python',
'arguments' : ['/ccs/home/aymen/hello_mpi.py']}))
While, RAPTOR-workers are not being launched due to the following error:
--------------------------------------------------------------------------
A requested component was not found, or was unable to be opened. This
means that this component is either not installed or is unable to be
used on your system (e.g., sometimes this means that shared libraries
that the component requires are unable to be found/loaded). Note that
Open MPI stopped checking at the first component that it did not find.
Host: f31n07
Framework: pml
Component: pami
--------------------------------------------------------------------------
--------------------------------------------------------------------------
It looks like MPI_INIT failed for some reason; your parallel process is
likely to abort. There are many reasons that a parallel process can
fail during MPI_INIT; some of which are due to configuration or environment
problems. This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
developer):
mca_pml_base_open() failed
--> Returned "Not found" (-13) instead of "Success" (0)
--------------------------------------------------------------------------
[f31n07:384871] *** An error occurred in MPI_Init_thread
[f31n07:384871] *** reported by process [1691222017,7]
[f31n07:384871] *** on a NULL communicator
[f31n07:384871] *** Unknown error
[f31n07:384871] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[f31n07:384871] *** and potentially your MPI job)
[batch5:2767346] 9 more processes have sent help message help-mca-base.txt / find-available:not-valid
[batch5:2767346] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
[batch5:2767346] 6 more processes have sent help message help-mpi-runtime.txt / mpi_init:startup:internal-failure
[batch5:2767346] 2 more processes have sent help message help-mpi-errors.txt / mpi_errors_are_fatal unknown handle
Andre proposed the following: 1- Remove the worker description here replace it with the task description above
@andre-merzky, finally! I can confirm that:
1- using an existing conda_env
in raptor env and
2- using MPIRUN
as an LM
3 - tweaking the raptor worker description in the current way works:
import sys
executable = sys.executable
td = rp.TaskDescription(
{'uid' : 'task.exe.c.%06d' % i,
#'named_env' : descr.get('named_env'),
'environment' : descr.get('environment', {}),
'mode' : rp.TASK_EXECUTABLE,
'scheduler' : None,
'cpu_processes' : descr['cpu_processes'],
'cpu_process_type': rp.MPI,
'executable' : executable,
'arguments' : [
'-c',
'import radical.pilot as rp; '
"rp.raptor.Worker.run('%s', '%s', '%s')"
% (descr.get('worker_file', ''),
descr.get('worker_class', 'DefaultWorker'),
fname)]})
submit 1 pilot(s)
pilot.0000 ornl.summit_interactive 10 cores 0 gpus ok
submit: ########################################################################
submit: ########################################################################
wait : ########################################################################
DONE : 10
ok
task.exe.c.000000 [DONE]: Hello from MPI rank 1 !
Hello from MPI rank 0 !
task.call.c.1.000000 [DONE]: ['hello: task.call.c.1.000000\n', 'hello: task.call.c.1.000000\n']
task.call.c.2.000000 [DONE]: ['hello 0/2: task.call.c.2.000000\n', 'hello 1/2: task.call.c.2.000000\n']
task.call.c.3.000000 [DONE]: ['\n']
task.mpi_ser_func.c.000000 [DONE]: ['hello 1/2: task.call.c.000000\n', 'hello 0/2: task.call.c.000000\n']
task.ser_func.c.000000 [DONE]: ['func_non_mpi\n', 'func_non_mpi\n']
task.eval.c.000000 [DONE]: ['hello 1/2: task.eval.c.000000\n', 'hello 0/2: task.eval.c.000000\n']
task.exec.c.000000 [DONE]: ['hello 0/2: task.exec.c.000000\n', 'hello 1/2: task.exec.c.000000\n']
task.proc.c.000000 [DONE]: ['hello 0/2: task.proc.c.000000\n', 'hello 1/2: task.proc.c.000000\n']
task.shell.c.000000 [DONE]: ['hello 1/2: task.shell.c.000000\n', 'hello 0/2: task.shell.c.000000\n']
closing session rp.session.batch1.aymen.019235.0007 \
The question now is: how should we proceed?
Woah, mpirun
now works on summit? That's good news indeed!
The question now is: how should we proceed?
Honestly, replacing jsrun
with mpirun
as default launch method for ornl.summit
seems like a reasonable option then... @mtitov: any opinion?
I would think we can have another configuration for Summit with MPIRun LM, and I lean towards keeping JSRun as default LM in ornl.summit
and to have MPIRun LM in ornl.summit_mpirun
, but no strong opinion about which LM should be in ornl.summit
.
BTW @AymenFJA , did you also check that with non-interactive run?
p.s. maybe we also can reconsider possibility to choose LM in PilotDescription: by default use first applicable LM from config (in respect of the defined order), as well as let user to pick a particular LM from a provided list of LMs
@mtitov , fundamentally it should. But to confirm yes, I did try and it worked from login_node
:
(rp-spectrum-cylon) [aymen@login3.summit raptor_login_node]$ python raptor.py
new session: [rp.session.login3.aymen.019235.0010] \
database : [mongodb://aymen:****@95.217.193.116:27017/radical3] ok
create pilot manager ok
create task manager ok
submit 1 pilot(s)
pilot.0000 ornl.summit 10 cores 0 gpus ok
submit: ########################################################################
submit: ########################################################################
wait : ########################################################################
DONE : 10
ok
task.exe.c.000000 [DONE]: Hello from MPI rank 1 !
Hello from MPI rank 0 !
task.call.c.1.000000 [DONE]: ['hello: task.call.c.1.000000\n', 'hello: task.call.c.1.000000\n']
task.call.c.2.000000 [DONE]: ['hello 1/2: task.call.c.2.000000\n', 'hello 0/2: task.call.c.2.000000\n']
task.call.c.3.000000 [DONE]: ['\n']
task.mpi_ser_func.c.000000 [DONE]: ['hello 1/2: task.call.c.000000\n', 'hello 0/2: task.call.c.000000\n']
task.ser_func.c.000000 [DONE]: ['func_non_mpi\n', 'func_non_mpi\n']
task.eval.c.000000 [DONE]: ['hello 0/2: task.eval.c.000000\n', 'hello 1/2: task.eval.c.000000\n']
task.exec.c.000000 [DONE]: ['hello 1/2: task.exec.c.000000\n', 'hello 0/2: task.exec.c.000000\n']
task.proc.c.000000 [DONE]: ['hello 1/2: task.proc.c.000000\n', 'hello 0/2: task.proc.c.000000\n']
task.shell.c.000000 [DONE]: ['hello 0/2: task.shell.c.000000\n', 'hello 1/2: task.shell.c.000000\n']
closing session rp.session.login3.aymen.019235.0010 \
close task manager ok
close pilot manager \
wait for 1 pilot(s)
0 ok
ok
+ rp.session.login3.aymen.019235.0010 (json)
+ pilot.0000 (profiles)
+ pilot.0000 (logfiles)
session lifetime: 145.3s ok
(rp-spectrum-cylon) [aymen@login3.summit raptor_login_node]$
Also, FYI and to consider regarding the next step (from Summit support):
Leah Huk commented:
Hi Aymen,
I'm sorry to hear the workaround didn't help. Unfortunately, we don't have
a solution right now for use with .erf files. It is a problem that will not be
fixed until the next OS upgrade, and we don't have a timeline for when
that will occur (our staff are focused on the Frontier launch).
Leah
p.s. maybe we also can reconsider possibility to choose LM in PilotDescription: by default use first applicable LM from config (in respect of the defined order), as well as let user to pick a particular LM from a provided list of LMs
Maybe - but out of scope of that specific ticket right now :-)
mpirun vs. jsrun: if jsrun is broken then it should not be the default config - but ack on keeping the config around. Maybe rename it to ornl.summit_jsrun
and make the default one for mpirun
then?
Leah Huk commented: ... Unfortunately, we don't have a solution right now for use with .erf files.
Maybe rename it to
ornl.summit_jsrun
and make the default one formpirun
then?
agree! (also considering that extra-confirmation on erf-files.. :) )
@andre-merzky , @mtitov I agree as well. One more thing, what about the raptor_worker
description should we adapt to the new changes? Two things got updated in the description above:
1- named_env
is problematic, i.e. mpirun will stumble with the error above It looks like MPI_INIT failed for some reason....
, unless we find a workaround.
2- get the entire python executable path and feed it to the worker description instead of just python
or python3
.
1-
named_env
is problematic, i.e. mpirun will stumble with the error aboveIt looks like MPI_INIT failed for some reason....
, unless we find a workaround.
note that module load python
comes with it's own deployment of mpi4py
, so you actually don't need to install it in the named_env
. That should hopefully resolve the MPI_Init
problems - can you give it a try?
2- get the entire python executable path and feed it to the worker description instead of just
python
orpython3
.
sys.executable
will not work if a named_env
is used.
Either way, having said all that: it should be ok if we document that on summit, named_env is not supported for raptor master and worker tasks. Would you agree?
update: I checked the path of where the Python is coming from, and it is coming from raptor_env
. I still see the same issue which is if I uncomment named_env
MPI fails. Note that, sys.executable
is no longer needed.
Please add to pre_exec: python3 -c 'import mpi4py; print(mpi4py.__file__)'
I can not access Summit because my account is being renewed, it will take up to 4 weeks as the help desk mentioned.
Closing this as it is outdated and in favor of #2827.
This ticket is related to our new use case,
Cylon
, it's blocking us from starting an initial scaling test on Summit.As a first step, we are trying to run
raptor.py
(executables and function (MPI/non-MPI)) on Summit from login Node / Interactively withJsrun
fails with multiple scenarios:Scenario 1 RP creates new Conda env for RAPTOR:
If I instruct RP to create RAPTOR environment with Conda with this setup:
Further, Summit instructions here to install
mpi4py
with Conda show the following:I do not know how to instruct RP to install mpi4py with these specific flags shown above.
I get this error under
task.exe.c.000000
:RAPTOR master and workers have no logs.
Scenario 2: RAPTOR uses an existing Conda env:
If I instruct RP to use RAPTOR environment with Conda with this setup (note that I followed Summit instructions to install mpi4py):
RAPTOR and RP tasks hang on wait with no error in logs.
Scenario 3 runs only MPI executables with RP via
00_getting_started.py
:I get the same error from scenario 1:
Note that, If I use the same environment and the same setup and example but I submit it via:
It works just fine with the following output:
BUT: if I use the same RP launching script for any task with the batch script above:
I get the same error:
FYI: These are only 3 scenarios. I will keep updating with more as I am testing more options now.