radical-cybertools / radical.pilot

RADICAL-Pilot
http://radical-cybertools.github.io/radical-pilot/index.html
Other
54 stars 23 forks source link

Add a resource definition for rivanna at UVa. #2855

Closed eirrgang closed 1 year ago

eirrgang commented 1 year ago

Ref #2835

eirrgang commented 1 year ago

Tested with

$ radical-stack

  python               : /home/mei2n/.virtualenvs/py3_11/bin/python3
  pythonpath           : /apps/software/standard/mpi/gcc/11.2.0/openmpi/4.1.4/python/3.11.1/easybuild/python
  version              : 3.11.1
  virtualenv           : /home/mei2n/.virtualenvs/py3_11

  radical.gtod         : 1.20.1
  radical.pilot        : 1.22.0-v1.4.0-4449-gf130471@2835-rivanna
  radical.saga         : 1.22.0-v1.21.0-5-gb73c644@devel
  radical.utils        : 1.22.0-v1.21.0-6-g38abe69@devel
codecov[bot] commented 1 year ago

Codecov Report

All modified and coverable lines are covered by tests :white_check_mark:

Project coverage is 40.95%. Comparing base (fe20ee9) to head (c05eb41). Report is 3233 commits behind head on devel.

Additional details and impacted files ```diff @@ Coverage Diff @@ ## devel #2855 +/- ## ======================================= Coverage 40.95% 40.95% ======================================= Files 94 94 Lines 10275 10275 ======================================= Hits 4208 4208 Misses 6067 6067 ```

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

eirrgang commented 1 year ago

The --ntasks-per-node 40 that gets generated by the new radical.saga may be causing trouble. It might be appropriate to upate that logic with something like min(cores, 40). Test jobs indicate that the --ntasks-per-node is superseding the --ntasks part of the request with salloc (but seemingly not with srun).

eirrgang commented 1 year ago

The --ntasks-per-node 40 that gets generated by the new radical.saga may be causing trouble. It might be appropriate to upate that logic with something like min(cores, 40). Test jobs indicate that the --ntasks-per-node is superseding the --ntasks part of the request with salloc (but seemingly not with srun).

I don't think this observation is relevant. Taking a closer look at the queued job:

$ scontrol show job 47818071
JobId=47818071 JobName=pilot.0000
   UserId=mei2n(860833) GroupId=users(100) MCS_label=N/A
   Priority=798174 Nice=0 Account=kas_dev QOS=normal
   JobState=PENDING Reason=Priority Dependency=(null)
   Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
   RunTime=00:00:00 TimeLimit=00:15:00 TimeMin=N/A
   SubmitTime=2023-03-20T08:00:37 EligibleTime=2023-03-20T08:00:37
   AccrueTime=2023-03-20T08:00:37
   StartTime=2023-03-21T22:03:11 EndTime=2023-03-21T22:18:11 Deadline=N/A
   SuspendTime=None SecsPreSuspend=0 LastSchedEval=2023-03-20T08:52:08 Scheduler=Backfill:*
   Partition=standard AllocNode:Sid=udc-ba36-36:21622
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=(null)
   NumNodes=1 NumCPUs=8 NumTasks=8 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
   TRES=cpu=8,mem=72000M,node=1,billing=8
   Socks/Node=* NtasksPerN:B:S:C=40:0:*:1 CoreSpec=*
   MinCPUsNode=40 MinMemoryCPU=9000M MinTmpDiskNode=0
   Features=(null) DelayBoot=00:00:00
   OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
   Command=/scratch/mei2n/radical.pilot.sandbox/rp.session.pmks03.eirrgang.019436.0004/pilot.0000/tmp_abk2cxmu.slurm
   WorkDir=/scratch/mei2n/radical.pilot.sandbox/rp.session.pmks03.eirrgang.019436.0004/pilot.0000/
   StdErr=/scratch/mei2n/radical.pilot.sandbox/rp.session.pmks03.eirrgang.019436.0004/pilot.0000//bootstrap_0.err
   StdIn=/dev/null
   StdOut=/scratch/mei2n/radical.pilot.sandbox/rp.session.pmks03.eirrgang.019436.0004/pilot.0000//bootstrap_0.out
   Power=

It isn't clear to me which resource is stuck.

$ squeue -j 47818071 --start
             JOBID PARTITION     NAME     USER ST          START_TIME  NODES SCHEDNODES           NODELIST(REASON)
          47818071  standard pilot.00    mei2n PD 2023-03-21T22:03:11      1 (null)               (Priority)
mtitov commented 1 year ago

@eirrgang Eric, just to conclude here, will interactive mode be added? and after that PR should be ready for merging

eirrgang commented 1 year ago

@eirrgang Eric, just to conclude here, will interactive mode be added? and after that PR should be ready for merging

I was going to let @AymenFJA take a look at the archive from a test run: https://github.com/radical-cybertools/radical.pilot/pull/2855/files#r1144929336 I should also re-test with saga 1.22.

I'll go ahead and try that right now...

We can also add "interactive" in a follow-up when we're ready.

AymenFJA commented 1 year ago

@eirrgang Eric, just to conclude here, will interactive mode be added? and after that PR should be ready for merging

I was going to let @AymenFJA take a look at the archive from a test run: https://github.com/radical-cybertools/radical.pilot/pull/2855/files#r1144929336 I should also re-test with saga 1.22.

I'll go ahead and try that right now...

We can also add "interactive" in a follow-up when we're ready.

@eirrgang

@eirrgang Eric, just to conclude here, will interactive mode be added? and after that PR should be ready for merging

I was going to let @AymenFJA take a look at the archive from a test run: https://github.com/radical-cybertools/radical.pilot/pull/2855/files#r1144929336 I should also re-test with saga 1.22.

I'll go ahead and try that right now...

We can also add "interactive" in a follow-up when we're ready.

@eirrgang checking your session files, I could not find any error on any level. I am assuming the MPI problem mentioned above on the task level is solved, or am I missing something?

eirrgang commented 1 year ago

@AymenFJA wrote:

@eirrgang checking your session files, I could not find any error on any level. I am assuming the MPI problem mentioned above on the task level is solved, or am I missing something?

Errors persist in "interactive" mode:

$ ./09_mpi_tasks.py uva.rivanna

================================================================================
 Getting Started (RP version 1.22.0)
================================================================================

new session: [rp.session.udc-aw29-23a.mei2n.019438.0005]                       \
database   : [mongodb://eirrgang:****@95.217.193.116/scalems]                 ok
read config                                                                   ok

--------------------------------------------------------------------------------
submit pilots

create pilot manager                                                          ok
submit 1 pilot(s)
        pilot.0000   uva.rivanna               8 cores       0 gpus           ok

--------------------------------------------------------------------------------
submit tasks

create task manager                                                           ok
create 2 task description(s)
        ..                                                                    ok
submit: ########################################################################

--------------------------------------------------------------------------------
gather results

wait  : ########################################################################
    DONE      :     2
                                                                              ok

  * task.000000: DONE, exit:   0, ranks: udc-aw29-23a 0:0/2 @ 0/40 : 0/1
udc-aw29-23a 0:0/2 @ 0/40 : 0/1
udc-aw29-23a 0:0/2 @ 0/40 : 0/1
udc-aw29-23a 0:1/2 @ 0/40 : 0/1
udc-aw29-23a 0:1/2 @ 0/40 : 0/1
udc-aw29-23a 0:1/2 @ 0/40 : 0/1

caught Exception: missing rank 1:0/2 (['0:0/2', '0:0/2', '0:0/2', '0:1/2', '0:1/2', '0:1/2'])

--------------------------------------------------------------------------------
finalize

closing session rp.session.udc-aw29-23a.mei2n.019438.0005                      \
close task manager                                                            ok
close pilot manager                                                            \
wait for 1 pilot(s)
              0                                                          timeout
                                                                              ok
session lifetime: 90.8s                                                       ok
Traceback (most recent call last):
  File "/sfs/qumulo/qhome/mei2n/projects/radical.pilot/examples/./09_mpi_tasks.py", line 127, in <module>
    assert rank in ranks, 'missing rank %s (%s)' % (rank, ranks)
           ^^^^^^^^^^^^^
AssertionError: missing rank 1:0/2 (['0:0/2', '0:0/2', '0:0/2', '0:1/2', '0:1/2', '0:1/2'])

This is the same error as was encountered in the job associated with the previously linked archive. The Tasks appear to succeed, but something about this invocation leads to different-than-expected values, it seems.

eirrgang commented 1 year ago

If this works as expected for @eirrgang and @AymenFJA , I would merge it.

The local mode worked fine. interactive failed as described. ssh was not fully tested, since something about the generated job request was taking forever to get through the queue.

AymenFJA commented 1 year ago

If this works as expected for @eirrgang and @AymenFJA , I would merge it.

The local mode worked fine. interactive failed as described. ssh was not fully tested, since something about the generated job request was taking forever to get through the queue.

Thanks, @eirrgang. Currently, I am waiting for my Rivanna request to be approved so I can get access to it. Once I have access, I will test the interactive and ssh modes.

eirrgang commented 1 year ago

Sorry... I think I accidentally deleted the source branch before this was resolved. Is this still the config we're going for?

AymenFJA commented 1 year ago

@eirrgang I was about to open a PR with the config file. But since you reopened this PR, would it be possible to update the config with this one (this is the one that I used in my test earlier):

{
    "rivanna":
    {
        "description"                 : "Heterogeneous community-model Linux cluster",
        "notes"                       : "Access from registered UVA IP address. See https://www.rc.virginia.edu/userinfo/rivanna/login/",
        "schemas"                     : ["local", "ssh", "interactive"],
        "local"                       :
        {
            "job_manager_endpoint"    : "slurm://rivanna.hpc.virginia.edu/",
            "filesystem_endpoint"     : "file://rivanna.hpc.virginia.edu/"
        },
        "ssh"                         :
        {
            "job_manager_endpoint"    : "slurm+ssh://rivanna.hpc.virginia.edu/",
            "filesystem_endpoint"     : "sftp://rivanna.hpc.virginia.edu/"
        },
         "interactive"                 :
        {
            "job_manager_endpoint"    : "fork://localhost/",
            "filesystem_endpoint"     : "file://localhost/"
        },
        "default_queue"               : "standard",
        "resource_manager"            : "SLURM",
        "agent_scheduler"             : "CONTINUOUS",
        "agent_spawner"               : "POPEN",
        "launch_methods"              : {
                                         "order": ["MPIRUN"],
                                         "MPIRUN" : {}
                                        },
        "pre_bootstrap_0"             : [
                                        "module load gcc/11.2.0",
                                        "module load openmpi/4.1.4",
                                        "module load python/3.11.1"
                                        ],
        "default_remote_workdir"      : "/scratch/$USER",
        "python_dist"                 : "default",
        "virtenv_dist"                : "default",
        "virtenv_mode"                : "create",
        "rp_version"                  : "local"
    }
}

Almost the same but with an interactive entry.

eirrgang commented 1 year ago

@eirrgang I was about to open a PR with the config file. But since you reopened this PR, would it be possible to update the config with this one (this is the one that I used in my test earlier)

Are there updates in devel or elsewhere that should be merged to resolve the problems with 09_mpi_tasks.py reported above?

AymenFJA commented 1 year ago

@eirrgang I was about to open a PR with the config file. But since you reopened this PR, would it be possible to update the config with this one (this is the one that I used in my test earlier)

Are there updates in devel or elsewhere that should be merged to resolve the problems with 09_mpi_tasks.py reported above?

No, I am testing the MPI to see if your issue above is reproducible. I will update the PR.

eirrgang commented 1 year ago

No, I am testing the MPI to see if your issue above is reproducible. I will update the PR.

I am not well acquainted with the RP "interactive" sessions. I may have chosen job parameters poorly, or something. Please note the slurm commands you use, if you are successful.

AymenFJA commented 1 year ago

No, I am testing the MPI to see if your issue above is reproducible. I will update the PR.

I am not well acquainted with the RP "interactive" sessions. I may have chosen job parameters poorly, or something. Please note the slurm commands you use, if you are successful.

To use the interactive mode you have to ways but the same pilot description:

        pd_init = {'resource'      : 'uva.rivanna',
                   'runtime'       : 30,  # pilot runtime (min)
                   'exit_on_error' : True,
                   'access_schema' : 'interactive',
                   'cores'         : 1,
                   'gpus'          : 0
                  }
  1. Ask for interactive node: ijob -p dev -t 00:30:00, and run your rp example with the pilot description above (python 00_getting_started.py)
  2. Or submit 00_getting_started.py as an sbatch job using the same pilot description above just create an rp_sbatch.sh file or so and call the 00_getting_started.py, below is the one that I used:
    (rct) -bash-4.2$ cat rp_sbatch.sh 
    #!/bin/sh
    #SBATCH --ntasks=1
    #SBATCH --ntasks-per-node=4
    #SBATCH -J "rp_sbatch"
    #SBATCH --output "rp_sbatch_0.out"
    #SBATCH --error "rp_sbatch_0.err"
    #SBATCH --account "YOUR PROJECT ID"
    #SBATCH --partition "dev"
    #SBATCH --time 00:30:00
    export RADICAL_LOG_LVL="DEBUG"
    export RADICAL_PROFILE="TRUE"
    export RADICAL_PILOT_DBURL=mongodb://aymen:XXXXXX@95.217.193.116:27017/radical3
    source $HOME/ve/rct/bin/activate
    python 00_getting_started.py 
eirrgang commented 1 year ago
  1. Ask for interactive node: ijob -p dev -t 00:30:00, and run your rp example with the pilot description above (python 00_getting_started.py)

It is specifically the MPI case that concerns me, assuming interactive mode is expected to support MPI / multiprocess workloads.

andre-merzky commented 1 year ago

@eirrgang : we are going to merge this before the MPI is tested again (queue times are slow right now). The MPI functionality is somewhat orthogonal to the configuration in this PR, and we want this PR to go into the release we are pushing out these days.