Closed eirrgang closed 1 year ago
Tested with
$ radical-stack
python : /home/mei2n/.virtualenvs/py3_11/bin/python3
pythonpath : /apps/software/standard/mpi/gcc/11.2.0/openmpi/4.1.4/python/3.11.1/easybuild/python
version : 3.11.1
virtualenv : /home/mei2n/.virtualenvs/py3_11
radical.gtod : 1.20.1
radical.pilot : 1.22.0-v1.4.0-4449-gf130471@2835-rivanna
radical.saga : 1.22.0-v1.21.0-5-gb73c644@devel
radical.utils : 1.22.0-v1.21.0-6-g38abe69@devel
All modified and coverable lines are covered by tests :white_check_mark:
Project coverage is 40.95%. Comparing base (
fe20ee9
) to head (c05eb41
). Report is 3233 commits behind head on devel.
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.
The --ntasks-per-node 40
that gets generated by the new radical.saga
may be causing trouble. It might be appropriate to upate that logic with something like min(cores, 40)
. Test jobs indicate that the --ntasks-per-node
is superseding the --ntasks
part of the request with salloc
(but seemingly not with srun
).
The
--ntasks-per-node 40
that gets generated by the newradical.saga
may be causing trouble. It might be appropriate to upate that logic with something likemin(cores, 40)
. Test jobs indicate that the--ntasks-per-node
is superseding the--ntasks
part of the request withsalloc
(but seemingly not withsrun
).
I don't think this observation is relevant. Taking a closer look at the queued job:
$ scontrol show job 47818071
JobId=47818071 JobName=pilot.0000
UserId=mei2n(860833) GroupId=users(100) MCS_label=N/A
Priority=798174 Nice=0 Account=kas_dev QOS=normal
JobState=PENDING Reason=Priority Dependency=(null)
Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
RunTime=00:00:00 TimeLimit=00:15:00 TimeMin=N/A
SubmitTime=2023-03-20T08:00:37 EligibleTime=2023-03-20T08:00:37
AccrueTime=2023-03-20T08:00:37
StartTime=2023-03-21T22:03:11 EndTime=2023-03-21T22:18:11 Deadline=N/A
SuspendTime=None SecsPreSuspend=0 LastSchedEval=2023-03-20T08:52:08 Scheduler=Backfill:*
Partition=standard AllocNode:Sid=udc-ba36-36:21622
ReqNodeList=(null) ExcNodeList=(null)
NodeList=(null)
NumNodes=1 NumCPUs=8 NumTasks=8 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
TRES=cpu=8,mem=72000M,node=1,billing=8
Socks/Node=* NtasksPerN:B:S:C=40:0:*:1 CoreSpec=*
MinCPUsNode=40 MinMemoryCPU=9000M MinTmpDiskNode=0
Features=(null) DelayBoot=00:00:00
OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
Command=/scratch/mei2n/radical.pilot.sandbox/rp.session.pmks03.eirrgang.019436.0004/pilot.0000/tmp_abk2cxmu.slurm
WorkDir=/scratch/mei2n/radical.pilot.sandbox/rp.session.pmks03.eirrgang.019436.0004/pilot.0000/
StdErr=/scratch/mei2n/radical.pilot.sandbox/rp.session.pmks03.eirrgang.019436.0004/pilot.0000//bootstrap_0.err
StdIn=/dev/null
StdOut=/scratch/mei2n/radical.pilot.sandbox/rp.session.pmks03.eirrgang.019436.0004/pilot.0000//bootstrap_0.out
Power=
It isn't clear to me which resource is stuck.
$ squeue -j 47818071 --start
JOBID PARTITION NAME USER ST START_TIME NODES SCHEDNODES NODELIST(REASON)
47818071 standard pilot.00 mei2n PD 2023-03-21T22:03:11 1 (null) (Priority)
@eirrgang Eric, just to conclude here, will interactive
mode be added? and after that PR should be ready for merging
@eirrgang Eric, just to conclude here, will
interactive
mode be added? and after that PR should be ready for merging
I was going to let @AymenFJA take a look at the archive from a test run: https://github.com/radical-cybertools/radical.pilot/pull/2855/files#r1144929336 I should also re-test with saga 1.22.
I'll go ahead and try that right now...
We can also add "interactive" in a follow-up when we're ready.
@eirrgang Eric, just to conclude here, will
interactive
mode be added? and after that PR should be ready for mergingI was going to let @AymenFJA take a look at the archive from a test run: https://github.com/radical-cybertools/radical.pilot/pull/2855/files#r1144929336 I should also re-test with saga 1.22.
I'll go ahead and try that right now...
We can also add "interactive" in a follow-up when we're ready.
@eirrgang
@eirrgang Eric, just to conclude here, will
interactive
mode be added? and after that PR should be ready for mergingI was going to let @AymenFJA take a look at the archive from a test run: https://github.com/radical-cybertools/radical.pilot/pull/2855/files#r1144929336 I should also re-test with saga 1.22.
I'll go ahead and try that right now...
We can also add "interactive" in a follow-up when we're ready.
@eirrgang checking your session files, I could not find any error on any level. I am assuming the MPI problem mentioned above on the task level is solved, or am I missing something?
@AymenFJA wrote:
@eirrgang checking your session files, I could not find any error on any level. I am assuming the MPI problem mentioned above on the task level is solved, or am I missing something?
Errors persist in "interactive" mode:
$ ./09_mpi_tasks.py uva.rivanna
================================================================================
Getting Started (RP version 1.22.0)
================================================================================
new session: [rp.session.udc-aw29-23a.mei2n.019438.0005] \
database : [mongodb://eirrgang:****@95.217.193.116/scalems] ok
read config ok
--------------------------------------------------------------------------------
submit pilots
create pilot manager ok
submit 1 pilot(s)
pilot.0000 uva.rivanna 8 cores 0 gpus ok
--------------------------------------------------------------------------------
submit tasks
create task manager ok
create 2 task description(s)
.. ok
submit: ########################################################################
--------------------------------------------------------------------------------
gather results
wait : ########################################################################
DONE : 2
ok
* task.000000: DONE, exit: 0, ranks: udc-aw29-23a 0:0/2 @ 0/40 : 0/1
udc-aw29-23a 0:0/2 @ 0/40 : 0/1
udc-aw29-23a 0:0/2 @ 0/40 : 0/1
udc-aw29-23a 0:1/2 @ 0/40 : 0/1
udc-aw29-23a 0:1/2 @ 0/40 : 0/1
udc-aw29-23a 0:1/2 @ 0/40 : 0/1
caught Exception: missing rank 1:0/2 (['0:0/2', '0:0/2', '0:0/2', '0:1/2', '0:1/2', '0:1/2'])
--------------------------------------------------------------------------------
finalize
closing session rp.session.udc-aw29-23a.mei2n.019438.0005 \
close task manager ok
close pilot manager \
wait for 1 pilot(s)
0 timeout
ok
session lifetime: 90.8s ok
Traceback (most recent call last):
File "/sfs/qumulo/qhome/mei2n/projects/radical.pilot/examples/./09_mpi_tasks.py", line 127, in <module>
assert rank in ranks, 'missing rank %s (%s)' % (rank, ranks)
^^^^^^^^^^^^^
AssertionError: missing rank 1:0/2 (['0:0/2', '0:0/2', '0:0/2', '0:1/2', '0:1/2', '0:1/2'])
This is the same error as was encountered in the job associated with the previously linked archive. The Tasks appear to succeed, but something about this invocation leads to different-than-expected values, it seems.
If this works as expected for @eirrgang and @AymenFJA , I would merge it.
The local
mode worked fine. interactive
failed as described. ssh
was not fully tested, since something about the generated job request was taking forever to get through the queue.
If this works as expected for @eirrgang and @AymenFJA , I would merge it.
The
local
mode worked fine.interactive
failed as described.ssh
was not fully tested, since something about the generated job request was taking forever to get through the queue.
Thanks, @eirrgang. Currently, I am waiting for my Rivanna request to be approved so I can get access to it. Once I have access, I will test the interactive
and ssh
modes.
Sorry... I think I accidentally deleted the source branch before this was resolved. Is this still the config we're going for?
@eirrgang I was about to open a PR with the config file. But since you reopened this PR, would it be possible to update the config with this one (this is the one that I used in my test earlier):
{
"rivanna":
{
"description" : "Heterogeneous community-model Linux cluster",
"notes" : "Access from registered UVA IP address. See https://www.rc.virginia.edu/userinfo/rivanna/login/",
"schemas" : ["local", "ssh", "interactive"],
"local" :
{
"job_manager_endpoint" : "slurm://rivanna.hpc.virginia.edu/",
"filesystem_endpoint" : "file://rivanna.hpc.virginia.edu/"
},
"ssh" :
{
"job_manager_endpoint" : "slurm+ssh://rivanna.hpc.virginia.edu/",
"filesystem_endpoint" : "sftp://rivanna.hpc.virginia.edu/"
},
"interactive" :
{
"job_manager_endpoint" : "fork://localhost/",
"filesystem_endpoint" : "file://localhost/"
},
"default_queue" : "standard",
"resource_manager" : "SLURM",
"agent_scheduler" : "CONTINUOUS",
"agent_spawner" : "POPEN",
"launch_methods" : {
"order": ["MPIRUN"],
"MPIRUN" : {}
},
"pre_bootstrap_0" : [
"module load gcc/11.2.0",
"module load openmpi/4.1.4",
"module load python/3.11.1"
],
"default_remote_workdir" : "/scratch/$USER",
"python_dist" : "default",
"virtenv_dist" : "default",
"virtenv_mode" : "create",
"rp_version" : "local"
}
}
Almost the same but with an interactive
entry.
@eirrgang I was about to open a PR with the config file. But since you reopened this PR, would it be possible to update the config with this one (this is the one that I used in my test earlier)
Are there updates in devel
or elsewhere that should be merged to resolve the problems with 09_mpi_tasks.py
reported above?
@eirrgang I was about to open a PR with the config file. But since you reopened this PR, would it be possible to update the config with this one (this is the one that I used in my test earlier)
Are there updates in
devel
or elsewhere that should be merged to resolve the problems with09_mpi_tasks.py
reported above?
No, I am testing the MPI to see if your issue above is reproducible. I will update the PR.
No, I am testing the MPI to see if your issue above is reproducible. I will update the PR.
I am not well acquainted with the RP "interactive" sessions. I may have chosen job parameters poorly, or something. Please note the slurm commands you use, if you are successful.
No, I am testing the MPI to see if your issue above is reproducible. I will update the PR.
I am not well acquainted with the RP "interactive" sessions. I may have chosen job parameters poorly, or something. Please note the slurm commands you use, if you are successful.
To use the interactive
mode you have to ways but the same pilot description:
pd_init = {'resource' : 'uva.rivanna',
'runtime' : 30, # pilot runtime (min)
'exit_on_error' : True,
'access_schema' : 'interactive',
'cores' : 1,
'gpus' : 0
}
ijob -p dev -t 00:30:00
, and run your rp example with the pilot description above
(python 00_getting_started.py
)00_getting_started.py
as an sbatch job
using the same pilot description above
just create an rp_sbatch.sh
file or so and call the 00_getting_started.py
, below is the one that I used:
(rct) -bash-4.2$ cat rp_sbatch.sh
#!/bin/sh
#SBATCH --ntasks=1
#SBATCH --ntasks-per-node=4
#SBATCH -J "rp_sbatch"
#SBATCH --output "rp_sbatch_0.out"
#SBATCH --error "rp_sbatch_0.err"
#SBATCH --account "YOUR PROJECT ID"
#SBATCH --partition "dev"
#SBATCH --time 00:30:00
export RADICAL_LOG_LVL="DEBUG"
export RADICAL_PROFILE="TRUE"
export RADICAL_PILOT_DBURL=mongodb://aymen:XXXXXX@95.217.193.116:27017/radical3
source $HOME/ve/rct/bin/activate
python 00_getting_started.py
- Ask for interactive node:
ijob -p dev -t 00:30:00
, and run your rp example with thepilot description above
(python 00_getting_started.py
)
It is specifically the MPI case that concerns me, assuming interactive mode is expected to support MPI / multiprocess workloads.
@eirrgang : we are going to merge this before the MPI is tested again (queue times are slow right now). The MPI functionality is somewhat orthogonal to the configuration in this PR, and we want this PR to go into the release we are pushing out these days.
Ref #2835