radical-cybertools / radical.pilot

RADICAL-Pilot
http://radical-cybertools.github.io/radical-pilot/index.html
Other
54 stars 23 forks source link

Feature/flux partitioning #3174

Closed andre-merzky closed 3 months ago

andre-merzky commented 5 months ago

This adds round-robin load balancing to the flux partitioning.

andre-merzky commented 4 months ago

@mtitov : this should be ready for review again - thanks.

codecov[bot] commented 4 months ago

Codecov Report

Attention: Patch coverage is 23.01587% with 97 lines in your changes missing coverage. Please review.

Project coverage is 44.32%. Comparing base (f508e64) to head (96cf8ff).

Files Patch % Lines
src/radical/pilot/agent/launch_method/flux.py 10.71% 50 Missing :warning:
src/radical/pilot/agent/executing/flux.py 15.62% 27 Missing :warning:
src/radical/pilot/task_manager.py 0.00% 14 Missing :warning:
src/radical/pilot/agent/agent_0.py 62.50% 3 Missing :warning:
src/radical/pilot/task.py 71.42% 2 Missing :warning:
src/radical/pilot/utils/session.py 0.00% 1 Missing :warning:
Additional details and impacted files ```diff @@ Coverage Diff @@ ## devel #3174 +/- ## ========================================== - Coverage 44.45% 44.32% -0.13% ========================================== Files 95 95 Lines 10371 10428 +57 ========================================== + Hits 4610 4622 +12 - Misses 5761 5806 +45 ```

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

mtitov commented 4 months ago

@andre-merzky Resource config for Frontier with Flux - we use resource manager SLURM and not FORK, right? https://github.com/radical-cybertools/radical.pilot/blob/48a8a284cae47d4dbf4c3cbc1ce6465ea7f2a374/src/radical/pilot/configs/resource_ornl.json#L142

andre-merzky commented 4 months ago

The stack below successfully runs two flux instances across two nodes each and round-robins the tasks cross all four nodes:

$ radical-stack

  python               : /autofs/nccs-svm1_home1/matitov/am/ve3/bin/python3
  pythonpath           : /opt/cray/pe/python/3.9.13.1
  version              : 3.9.13
  virtualenv           : /autofs/nccs-svm1_home1/matitov/am/ve3

  radical.gtod         : 1.60.0
  radical.pilot        : 1.61.0-v1.60.0-38-gb7d1254ae@feature/flux_partitioning
  radical.saga         : 1.60.0
  radical.utils        : 1.61.0-v1.60.0-5-gf9194653@fix/flux_remote
mtitov commented 4 months ago

Flux still doesn't see all resources and still there is an error regarding task environment

flux resources [ 0 ]:
     STATE NNODES   NCORES    NGPUS NODELIST
      free      1        1        1 frontier10417
 allocated      0        0        0 
      down      0        0        0 

Srun command list

flux command: srun -n 1 -N 1 --ntasks-per-node 1 --export=ALL \
              flux start bash -c echo "HOST:$(hostname) URI:$FLUX_URI" && sleep inf

Task error message

$ cat task.000000.err 
/lustre/orion/scratch/matitov/chm155/flux/radical.pilot.sandbox/rp.session.frontier10417.matitov.019873.0004//pilot.0000//gtod: error while loading shared libraries: libfabric.so.1: cannot open shared object file: No such file or directory
/lustre/orion/scratch/matitov/chm155/flux/radical.pilot.sandbox/rp.session.frontier10417.matitov.019873.0004//pilot.0000//gtod: error while loading shared libraries: libfabric.so.1: cannot open shared object file: No such file or directory
/lustre/orion/scratch/matitov/chm155/flux/radical.pilot.sandbox/rp.session.frontier10417.matitov.019873.0004/pilot.0000/task.000000//task.000000.exec.sh: line 49: module: command not found
pre_exec failed
andre-merzky commented 4 months ago

wohoo:

(ve3) [matitov@login03.frontier pilot.0000]$ grep -C 3 allocated *log
flux.0155.log-1717146059.324 : flux.0155            : 124106 : 140731208460032 : INFO     : flux resources [ 0 ]:
flux.0155.log-     STATE NNODES   NCORES    NGPUS NODELIST
flux.0155.log-      free      2      112       16 frontier[07800,10484]
flux.0155.log: allocated      0        0        0
flux.0155.log-      down      0        0        0
flux.0155.log-
andre-merzky commented 4 months ago

This is the launch command:

srun -n 2 -N 2 --ntasks-per-node 1 --cpus-per-task=112 --gpus-per-task=8 --export=ALL flux start bash -c echo "HOST:$(hostname) URI:$FLUX_URI" && sleep inf

from this:

launcher = 'srun -n %s -N %d --ntasks-per-node 1 --cpus-per-task=%d --gpus-per-task=%d --export=ALL' \
                           % (nodes_per_partition, nodes_per_partition, threads_per_node, gpus_per_node)
mtitov commented 3 months ago

it does execute test example correctly, but within task.000000.err, there are 2 following error messages

/lustre/orion/scratch/matitov/chm155/flux/radical.pilot.sandbox/rp.session.frontier10272.matitov.019900.0004//pilot.0000//gtod: error while loading shared libraries: libfabric.so.1: cannot open shared object file: No such file or directory
/lustre/orion/scratch/matitov/chm155/flux/radical.pilot.sandbox/rp.session.frontier10272.matitov.019900.0004//pilot.0000//gtod: error while loading shared libraries: libfabric.so.1: cannot open shared object file: No such file or directory

which affect task profiling

$ cat task.000000.prof 
0.0000000,exec_start,,MainThread,task.000000,AGENT_EXECUTING,
0.0000000,exec_pre,,MainThread,task.000000,AGENT_EXECUTING,
1719379110.4348950,rank_start,,MainThread,task.000000,AGENT_EXECUTING,
1719379140.4490670,rank_stop,,MainThread,task.000000,AGENT_EXECUTING,RP_EXEC_PID=61092:RP_RANK_PID=61173
1719379140.4596250,exec_post,,MainThread,task.000000,AGENT_EXECUTING,
1719379140.4695820,exec_stop,,MainThread,task.000000,AGENT_EXECUTING,

Example of task.000000.exec.sh is below, within the "details" section

``` $ cat task.000000.exec.sh #!/usr/bin/bash # ------------------------------------------------------------------------------ export RP_TASK_ID="task.000000" export RP_TASK_NAME="task.000000" export RP_PILOT_ID="pilot.0000" export RP_SESSION_ID="rp.session.frontier10272.matitov.019900.0004" export RP_RESOURCE="ornl.frontier_flux" export RP_RESOURCE_SANDBOX="/lustre/orion/scratch/matitov/chm155/flux/radical.pilot.sandbox" export RP_SESSION_SANDBOX="$RP_RESOURCE_SANDBOX/$RP_SESSION_ID/" export RP_PILOT_SANDBOX="$RP_SESSION_SANDBOX/pilot.0000/" export RP_TASK_SANDBOX="$RP_PILOT_SANDBOX/task.000000" export RP_REGISTRY_ADDRESS="tcp://10.128.160.148:10002" export RP_CORES_PER_RANK=50 export RP_GPUS_PER_RANK=1 export RP_GTOD="$RP_PILOT_SANDBOX/gtod" export RP_PROF="$RP_PILOT_SANDBOX/prof" export RP_PROF_TGT="$RP_PILOT_SANDBOX/task.000000/task.000000.prof" rp_error() { echo "$1 failed" 1>&2 exit 1 } # ------------------------------------------------------------------------------ # rank ID export RP_RANKS=1 export RP_RANK=$FLUX_TASK_RANK rp_sync_ranks() { sig=$1 echo $RP_RANK >> $sig.sig while test $(cat $sig.sig | wc -l) -lt $RP_RANKS; do sleep 1 done } # ------------------------------------------------------------------------------ $RP_PROF exec_start "" # task env settings export RP_PARTITION_ID="1" # ------------------------------------------------------------------------------ # pre-exec commands $RP_PROF exec_pre "" env > env.$RP_RANK.a.dump || rp_error pre_exec . /sw/frontier/init/profile || rp_error pre_exec env > env.$RP_RANK.b.dump || rp_error pre_exec module reset || rp_error pre_exec module load PrgEnv-gnu || rp_error pre_exec module load rocm/6.0.0 || rp_error pre_exec export TF_FORCE_GPU_ALLOW_GROWTH=true || rp_error pre_exec export MIOPEN_USER_DB_PATH=$RP_PILOT_SANDBOX/miopen-cache || rp_error pre_exec export MIOPEN_CUSTOM_CACHE_DIR=${MIOPEN_USER_DB_PATH} || rp_error pre_exec mkdir -p ${MIOPEN_USER_DB_PATH} || rp_error pre_exec source /ccs/proj/chm155/IMPECCABLE/miniconda/bin/activate st_mpi_base || rp_error pre_exec cd /lustre/orion/scratch/matitov/chm155/flux/ST || rp_error pre_exec # ------------------------------------------------------------------------------ # execute rank $RP_PROF rank_start "" python3 "smiles_regress_transformer_run_large.py" & RP_EXEC_PID=$$ RP_RANK_PID=$! wait $RP_RANK_PID RP_RET=$? $RP_PROF rank_stop "RP_EXEC_PID=$RP_EXEC_PID:RP_RANK_PID=$RP_RANK_PID" # ------------------------------------------------------------------------------ # post-exec commands $RP_PROF exec_post "" # ------------------------------------------------------------------------------ $RP_PROF exec_stop "" exit $RP_RET # ------------------------------------------------------------------------------ ```

p.s., if module operations are in the pre-exec, then activation of lmod is necessary -> . /sw/frontier/init/profile, and that sets the following env variables

$ diff env.0.a.dump env.0.b.dump
7a8,9
> LMOD_MODULERCFILE=/sw/frontier/lmod/etc/rc.lua
> LMOD_SYSTEM_NAME=frontier
8a11
> MEMBERWORK=/lustre/orion/scratch/
12a16
> MPICH_OFI_NIC_POLICY=NUMA
15a20
> RFE_811452_DISABLE=1
35a41,42
> PROJWORK=/lustre/orion/proj-shared
> HWLOC_PCI_LOCALITY=/usr/share/hwloc/pci-locality-hpe-cray-ex235a
41a49
> FI_CXI_ATS=0
45a54,55
> LMOD_PACKAGE_PATH=/sw/frontier/lmod/libexec
> PATH=/sw/frontier/bin:/usr/local/bin:/usr/bin:/bin:.
52a63
> WORLDWORK=/lustre/orion/world-shared

(/lustre/orion/scratch/matitov/chm155/flux/radical.pilot.sandbox/rp.session.frontier10272.matitov.019900.0004)

andre-merzky commented 3 months ago

Ah, I should have linked this PR: radical-cybertools/radical.gtod/pull/20 to resolve the profiler issue.

mtitov commented 3 months ago

Ah, I should have linked this PR: https://github.com/radical-cybertools/radical.gtod/pull/20 to resolve the profiler issue.

it did resolve

mtitov commented 3 months ago

@andre-merzky we don't have anything else left for this PR, right? (though, we do have a final step - having a default number for partitions =1)

andre-merzky commented 3 months ago

@andre-merzky we don't have anything else left for this PR, right? (though, we do have a final step - having a default number for partitions =1)

Right - I pushed the deafault setting now.