radical-cybertools / radical.pilot

RADICAL-Pilot
http://radical-cybertools.github.io/radical-pilot/index.html
Other
54 stars 23 forks source link

Support Rivanna machine #2835

Closed AymenFJA closed 1 year ago

AymenFJA commented 1 year ago

related to https://github.com/radical-cybertools/other_activities/issues/23

Rivanna: https://www.rc.virginia.edu/userinfo/rivanna/overview/

eirrgang commented 1 year ago

I am a rivanna user and will be interested in this issue.

If anyone has a resource definition that they have used, could you please share it with me as a starting point?

Note: I do not have access to the issue tracker linked in the other_activities project.

eirrgang commented 1 year ago

In an email thread, the following resource_uva.json file was shared:

{
    "rivanna":
    {
        "description"                 : "Heterogeneous community-model Linux cluster",
        "notes"                       : "Access from registered IP address",
        "schemas"                     : ["local", "ssh", "interactive"],
        "local"                       :
        {
            "job_manager_endpoint"    : "slurm://rivanna.hpc.virginia.edu/",
            "filesystem_endpoint"     : "file://rivanna.hpc.virginia.edu/"
        },
        "ssh"                         :
        {
            "job_manager_endpoint"    : "slurm+ssh://rivanna.hpc.virginia.edu/",
            "filesystem_endpoint"     : "sftp://rivanna.hpc.virginia.edu/"
        },
        "interactive"                 :
        {
            "job_manager_endpoint"    : "fork://localhost/",
            "filesystem_endpoint"     : "file://localhost/"
        },
        "default_queue"               : "main",
        "resource_manager"            : "SLURM",
        "agent_scheduler"             : "CONTINUOUS",
        "agent_spawner"               : "POPEN",
        "launch_methods"              : {
                                         "order": ["MPIRUN"],
                                         "MPIRUN" : {}
                                        },
        "pre_bootstrap_0"             :[
                                        "module load gcc/11.2.0",
                                        "module load openmpi/4.1.4"
                                        ],
        "default_remote_workdir"      : "/scratch/$USER",
        "python_dist"                 : "default",
        "virtenv_dist"                : "default",
        "virtenv_mode"                : "create",
        "rp_version"                  : "local"
    }
}

I offer the following observations

environment

These can be resolved by using modules gcc/11.2.0 openmpi/3.1.6 python/3.8.8

invocation

Examples like 00_getting_started.py use os.path.dirname(__file__) to try to find the config.json file in the same directory, which does not work if the example is executed with python 00_getting_started.py instead of ./00_getting_started.py.

resource definition

mtitov commented 1 year ago

@eirrgang Eric, thank you for this update

pre_bootstrap_0 loads gcc/11.2.0 and openmpi/4.1.4, which would be consistent with the requirements of the python/3.11.1 module, but the python/3.11.1 module is not explicitly mentioned.

Should we go for py3.11? if it is possible to set a lower version, that sounds like a safe approach.

radical.utils does not seem to be compatible with python 3.11

This was fixed in radical.utils (RU PR #372)

The slurm script generated includes #SBATCH --ntasks-per-node=128, which is not valid. This is resolved by adding "cores_per_node" to the resource definition. For the "standard" queue, this is 40. Other queues have different characteristics. I think there might be a way to specify this per queue, but it is not documented

Right, SAGA is updated accordingly (along with the latest fix for Bridges2)

And speaking about RP example, we'll fix it as well

@AymenFJA, can you please create a corresponding PR, considering Eric's comments about default queue

eirrgang commented 1 year ago

Should we go for py3.11? if it is possible to set a lower version, that sounds like a safe approach.

Personally, I need Python>=3.9 for the application code and for the script that launches the RP session. These don't have to use the same Python environment as the Pilot, but it is definitely convenient if they can.

If the ru and rs fixes will be in place before a resource_uva.json is bundled, Py 3.11 would seem like a reasonable default. If resource_uva.json is not expected to be bundled, then it seems sufficient to note here that two alternative configurations are

        "pre_bootstrap_0"             :[
                                        "module load gcc/11.2.0",
                                        "module load openmpi/4.1.4",
                                        "module load python/3.11.1"
                                        ],

and,

        "pre_bootstrap_0"             :[
                                        "module load gcc/11.2.0",
                                        "module load openmpi/3.1.6",
                                        "module load python/3.8.8"
                                        ],
mtitov commented 1 year ago

@eirrgang I see, yeah, then 3.11 is to go for Rivanna

then it seems sufficient to note here that two alternative configurations are

agree

AymenFJA commented 1 year ago

thanks @eirrgang for updating the ticket. Would it be possible to open a PR toward RP with Rivanna config, please? Assuming that you were able to test the config successfully on Rivanna.

mturilli commented 1 year ago

See https://github.com/radical-cybertools/radical.pilot/pull/2855

AymenFJA commented 1 year ago

RP-Rivanna Interactive Job Passed

(rct) -bash-4.2$ls
CHANGES.md  LICENSE.md  MANIFEST.in  README.md  TODO  VERSION  bin  docker  docs  examples  requirements-docs.txt  requirements-tests.txt  requirements.txt  setup.py  src  tests
(rct) -bash-4.2$ijob -p dev
salloc: Pending job allocation 49268267
salloc: job 49268267 queued and waiting for resources
salloc: job 49268267 has been allocated resources
salloc: Granted job allocation 49268267
salloc: Waiting for resource configuration
salloc: Nodes udc-ba27-18 are ready for job
bash-4.2$source ~/ve/rct/bin/activate
(rct) bash-4.2$export RADICAL_LOG_LVL="DEBUG"
(rct) bash-4.2$export RADICAL_PROFILE="TRUE"
(rct) bash-4.2$export RADICAL_PILOT_DBURL=mongodb://aymen:vdpXXXXX@XXXXX:27017/radical3
(rct) bash-4.2$

(rct) bash-4.2$vi 00_getting_started.py 
(rct) bash-4.2$python 00_getting_started.py 
================================================================================
 Getting Started (RP version 1.22.0)                                            
================================================================================
new session: [rp.session.udc-ba27-18.vaf8uz.019468.0003]                       \
database   : [mongodb://aymen:****@95.217.193.116:27017/radical3]             ok
create pilot manager                                                          ok
create task manager                                                           ok
--------------------------------------------------------------------------------
submit pilots                                                                   
submit 1 pilot(s)
        pilot.0000   uva.rivanna               1 cores       0 gpus           ok
--------------------------------------------------------------------------------
submit 10 tasks                                                                 
create: ########################################################################
submit: ########################################################################
wait  : ########################################################################
        DONE      :    10
                                                                              ok
--------------------------------------------------------------------------------
finalize                                                                        
closing session rp.session.udc-ba27-18.vaf8uz.019468.0003                      \
close task manager                                                            ok
close pilot manager                                                            \
wait for 1 pilot(s)
              0                                                          timeout
                                                                              ok
+ rp.session.udc-ba27-18.vaf8uz.019468.0003 (json)
+ pilot.0000 (profiles)
+ pilot.0000 (logfiles)
session lifetime: 81.3s                                                       ok
--------------------------------------------------------------------------------

RP-Rivanna local (login node) Passed:

(rct) -bash-4.2$hostname
udc-ba34-36 (login node)
(rct) -bash-4.2$python 00_getting_started.py 
================================================================================
 Getting Started (RP version 1.22.0)                                            
================================================================================
new session: [rp.session.udc-ba34-36.vaf8uz.019468.0005]                       \
database   : [mongodb://aymen:****@95.217.193.116:27017/radical3]             ok
create pilot manager                                                          ok
create task manager                                                           ok
--------------------------------------------------------------------------------
submit pilots                                                                   
submit 1 pilot(s)
        pilot.0000   uva.rivanna               1 cores       0 gpus           ok
--------------------------------------------------------------------------------
submit 10 tasks                                                                 
create: ########################################################################
submit: ########################################################################
wait  : ########################################################################
        DONE      :    10
                                                                              ok
--------------------------------------------------------------------------------
finalize                                                                        
closing session rp.session.udc-ba34-36.vaf8uz.019468.0005                      \
close task manager                                                            ok
close pilot manager                                                            \
wait for 1 pilot(s)
              0                                                               ok
                                                                              ok
+ rp.session.udc-ba34-36.vaf8uz.019468.0005 (json)
+ pilot.0000 (profiles)
+ pilot.0000 (logfiles)
session lifetime: 90.6s                                                       ok
--------------------------------------------------------------------------------

RP-Rivanna interactive (batch mode) Passed:

(rct) -bash-4.2$tail -f rp_sbatch_0.err 
================================================================================
 Getting Started (RP version 1.22.0)                                            
================================================================================
new session: [rp.session.udc-ba27-18.vaf8uz.019469.0001]                       \
database   : [mongodb://aymen:****@95.217.193.116:27017/radical3]             ok
create pilot manager                                                          ok
create task manager                                                           ok
--------------------------------------------------------------------------------
submit pilots                                                                   
submit 1 pilot(s)
        pilot.0000   uva.rivanna               1 cores       0 gpus           ok
--------------------------------------------------------------------------------
submit 10 tasks                                                                 
create: ########################################################################
submit: ########################################################################
wait  : ########################################################################
        DONE      :    10
                                                                              ok
--------------------------------------------------------------------------------
finalize                                                                        
closing session rp.session.udc-ba27-18.vaf8uz.019469.0001                      \
close task manager                                                            ok
close pilot manager                                                            \
wait for 1 pilot(s)
              0                                                          timeout
                                                                              ok
+ rp.session.udc-ba27-18.vaf8uz.019469.0001 (json)
+ pilot.0000 (profiles)
+ pilot.0000 (logfiles)
session lifetime: 80.1s                                                       ok
--------------------------------------------------------------------------------
eirrgang commented 1 year ago

RP-Rivanna Interactive Job Passed

Can you share the slurm command line you used to get the interactive job?

Were there any non-trivial changes needed to https://github.com/radical-cybertools/radical.pilot/pull/2855?

I think we should make sure to test some MPI examples before declaring success on the interactive mode, if MPI in interactive mode is supposed to be supported.

AymenFJA commented 1 year ago

RP-Rivanna Interactive Job Passed

Can you share the slurm command line you used to get the interactive job?

Were there any non-trivial changes needed to #2855?

I think we should make sure to test some MPI examples before declaring success on the interactive mode, if MPI in interactive mode is supposed to be supported.

I added it here: https://github.com/radical-cybertools/radical.pilot/pull/2855#issuecomment-1518442948

AymenFJA commented 1 year ago

closing this, https://github.com/radical-cybertools/radical.pilot/pull/2855

eirrgang commented 1 year ago

closing this, #2855

Were you ever able to confirm whether "interactive" worked with an MPI example?

AymenFJA commented 1 year ago

@eirrgang sorry for the late response. I did try that, yet the example of MPI (I was requesting one standard node) was pending for ~20 hours and never got an allocation, and I did not try again. Regardless, Rivanna resource configuration and if RP in general runs on Rivanna or not is different from if the MPI example is working or not. Thus, I opened a ticket here to follow up on the MPI issue, and I will pick it up soon.