Closed AymenFJA closed 1 year ago
I am a rivanna user and will be interested in this issue.
If anyone has a resource definition that they have used, could you please share it with me as a starting point?
Note: I do not have access to the issue tracker linked in the other_activities
project.
In an email thread, the following resource_uva.json
file was shared:
{
"rivanna":
{
"description" : "Heterogeneous community-model Linux cluster",
"notes" : "Access from registered IP address",
"schemas" : ["local", "ssh", "interactive"],
"local" :
{
"job_manager_endpoint" : "slurm://rivanna.hpc.virginia.edu/",
"filesystem_endpoint" : "file://rivanna.hpc.virginia.edu/"
},
"ssh" :
{
"job_manager_endpoint" : "slurm+ssh://rivanna.hpc.virginia.edu/",
"filesystem_endpoint" : "sftp://rivanna.hpc.virginia.edu/"
},
"interactive" :
{
"job_manager_endpoint" : "fork://localhost/",
"filesystem_endpoint" : "file://localhost/"
},
"default_queue" : "main",
"resource_manager" : "SLURM",
"agent_scheduler" : "CONTINUOUS",
"agent_spawner" : "POPEN",
"launch_methods" : {
"order": ["MPIRUN"],
"MPIRUN" : {}
},
"pre_bootstrap_0" :[
"module load gcc/11.2.0",
"module load openmpi/4.1.4"
],
"default_remote_workdir" : "/scratch/$USER",
"python_dist" : "default",
"virtenv_dist" : "default",
"virtenv_mode" : "create",
"rp_version" : "local"
}
}
I offer the following observations
File "/home/mei2n/.virtualenvs/py3_11/lib/python3.11/site-packages/radical/utils/signatures.py", line 175, in <module>
from inspect import getargspec, is-class
ImportError: cannot import name 'getargspec' from 'inspect' (/apps/software/standard/mpi/gcc/11.2.0/openmpi/4.1.4/python/3.11.1/lib/python3.11/inspect.py)
These can be resolved by using modules gcc/11.2.0 openmpi/3.1.6 python/3.8.8
Examples like 00_getting_started.py use os.path.dirname(__file__)
to try to find the config.json
file in the same directory, which does not work if the example is executed with python 00_getting_started.py
instead of ./00_getting_started.py
.
#SBATCH --ntasks-per-node=128
, which is not valid. This is resolved by adding "cores_per_node" to the resource definition. For the "standard" queue, this is 40. Other queues have different characteristics. I think there might be a way to specify this per queue, but it is not documented@eirrgang Eric, thank you for this update
pre_bootstrap_0 loads gcc/11.2.0 and openmpi/4.1.4, which would be consistent with the requirements of the python/3.11.1 module, but the python/3.11.1 module is not explicitly mentioned.
Should we go for py3.11? if it is possible to set a lower version, that sounds like a safe approach.
radical.utils does not seem to be compatible with python 3.11
This was fixed in radical.utils
(RU PR #372)
The slurm script generated includes #SBATCH --ntasks-per-node=128, which is not valid. This is resolved by adding "cores_per_node" to the resource definition. For the "standard" queue, this is 40. Other queues have different characteristics. I think there might be a way to specify this per queue, but it is not documented
Right, SAGA is updated accordingly (along with the latest fix for Bridges2)
And speaking about RP example, we'll fix it as well
@AymenFJA, can you please create a corresponding PR, considering Eric's comments about default queue
Should we go for py3.11? if it is possible to set a lower version, that sounds like a safe approach.
Personally, I need Python>=3.9 for the application code and for the script that launches the RP session. These don't have to use the same Python environment as the Pilot, but it is definitely convenient if they can.
If the ru and rs fixes will be in place before a resource_uva.json
is bundled, Py 3.11 would seem like a reasonable default. If resource_uva.json
is not expected to be bundled, then it seems sufficient to note here that two alternative configurations are
"pre_bootstrap_0" :[
"module load gcc/11.2.0",
"module load openmpi/4.1.4",
"module load python/3.11.1"
],
and,
"pre_bootstrap_0" :[
"module load gcc/11.2.0",
"module load openmpi/3.1.6",
"module load python/3.8.8"
],
@eirrgang I see, yeah, then 3.11 is to go for Rivanna
then it seems sufficient to note here that two alternative configurations are
agree
thanks @eirrgang for updating the ticket. Would it be possible to open a PR toward RP with Rivanna config, please? Assuming that you were able to test the config successfully on Rivanna.
RP-Rivanna
Interactive Job
Passed
(rct) -bash-4.2$ls
CHANGES.md LICENSE.md MANIFEST.in README.md TODO VERSION bin docker docs examples requirements-docs.txt requirements-tests.txt requirements.txt setup.py src tests
(rct) -bash-4.2$ijob -p dev
salloc: Pending job allocation 49268267
salloc: job 49268267 queued and waiting for resources
salloc: job 49268267 has been allocated resources
salloc: Granted job allocation 49268267
salloc: Waiting for resource configuration
salloc: Nodes udc-ba27-18 are ready for job
bash-4.2$source ~/ve/rct/bin/activate
(rct) bash-4.2$export RADICAL_LOG_LVL="DEBUG"
(rct) bash-4.2$export RADICAL_PROFILE="TRUE"
(rct) bash-4.2$export RADICAL_PILOT_DBURL=mongodb://aymen:vdpXXXXX@XXXXX:27017/radical3
(rct) bash-4.2$
(rct) bash-4.2$vi 00_getting_started.py
(rct) bash-4.2$python 00_getting_started.py
================================================================================
Getting Started (RP version 1.22.0)
================================================================================
new session: [rp.session.udc-ba27-18.vaf8uz.019468.0003] \
database : [mongodb://aymen:****@95.217.193.116:27017/radical3] ok
create pilot manager ok
create task manager ok
--------------------------------------------------------------------------------
submit pilots
submit 1 pilot(s)
pilot.0000 uva.rivanna 1 cores 0 gpus ok
--------------------------------------------------------------------------------
submit 10 tasks
create: ########################################################################
submit: ########################################################################
wait : ########################################################################
DONE : 10
ok
--------------------------------------------------------------------------------
finalize
closing session rp.session.udc-ba27-18.vaf8uz.019468.0003 \
close task manager ok
close pilot manager \
wait for 1 pilot(s)
0 timeout
ok
+ rp.session.udc-ba27-18.vaf8uz.019468.0003 (json)
+ pilot.0000 (profiles)
+ pilot.0000 (logfiles)
session lifetime: 81.3s ok
--------------------------------------------------------------------------------
RP-Rivanna
local
(login node) Passed:
(rct) -bash-4.2$hostname
udc-ba34-36 (login node)
(rct) -bash-4.2$python 00_getting_started.py
================================================================================
Getting Started (RP version 1.22.0)
================================================================================
new session: [rp.session.udc-ba34-36.vaf8uz.019468.0005] \
database : [mongodb://aymen:****@95.217.193.116:27017/radical3] ok
create pilot manager ok
create task manager ok
--------------------------------------------------------------------------------
submit pilots
submit 1 pilot(s)
pilot.0000 uva.rivanna 1 cores 0 gpus ok
--------------------------------------------------------------------------------
submit 10 tasks
create: ########################################################################
submit: ########################################################################
wait : ########################################################################
DONE : 10
ok
--------------------------------------------------------------------------------
finalize
closing session rp.session.udc-ba34-36.vaf8uz.019468.0005 \
close task manager ok
close pilot manager \
wait for 1 pilot(s)
0 ok
ok
+ rp.session.udc-ba34-36.vaf8uz.019468.0005 (json)
+ pilot.0000 (profiles)
+ pilot.0000 (logfiles)
session lifetime: 90.6s ok
--------------------------------------------------------------------------------
RP-Rivanna interactive
(batch mode) Passed:
(rct) -bash-4.2$tail -f rp_sbatch_0.err
================================================================================
Getting Started (RP version 1.22.0)
================================================================================
new session: [rp.session.udc-ba27-18.vaf8uz.019469.0001] \
database : [mongodb://aymen:****@95.217.193.116:27017/radical3] ok
create pilot manager ok
create task manager ok
--------------------------------------------------------------------------------
submit pilots
submit 1 pilot(s)
pilot.0000 uva.rivanna 1 cores 0 gpus ok
--------------------------------------------------------------------------------
submit 10 tasks
create: ########################################################################
submit: ########################################################################
wait : ########################################################################
DONE : 10
ok
--------------------------------------------------------------------------------
finalize
closing session rp.session.udc-ba27-18.vaf8uz.019469.0001 \
close task manager ok
close pilot manager \
wait for 1 pilot(s)
0 timeout
ok
+ rp.session.udc-ba27-18.vaf8uz.019469.0001 (json)
+ pilot.0000 (profiles)
+ pilot.0000 (logfiles)
session lifetime: 80.1s ok
--------------------------------------------------------------------------------
RP-Rivanna
Interactive Job
Passed
Can you share the slurm command line you used to get the interactive job?
Were there any non-trivial changes needed to https://github.com/radical-cybertools/radical.pilot/pull/2855?
I think we should make sure to test some MPI examples before declaring success on the interactive mode, if MPI in interactive mode is supposed to be supported.
RP-Rivanna
Interactive Job
PassedCan you share the slurm command line you used to get the interactive job?
Were there any non-trivial changes needed to #2855?
I think we should make sure to test some MPI examples before declaring success on the interactive mode, if MPI in interactive mode is supposed to be supported.
I added it here: https://github.com/radical-cybertools/radical.pilot/pull/2855#issuecomment-1518442948
closing this, #2855
Were you ever able to confirm whether "interactive" worked with an MPI example?
@eirrgang sorry for the late response. I did try that, yet the example of MPI (I was requesting one standard
node) was pending for ~20 hours and never got an allocation, and I did not try again. Regardless, Rivanna resource configuration and if RP in general runs on Rivanna or not is different from if the MPI example is working or not. Thus, I opened a ticket here to follow up on the MPI issue, and I will pick it up soon.
related to https://github.com/radical-cybertools/other_activities/issues/23
Rivanna: https://www.rc.virginia.edu/userinfo/rivanna/overview/