Closed andre-merzky closed 3 months ago
@mtitov : this should be ready for review again - thanks.
Attention: Patch coverage is 23.01587%
with 97 lines
in your changes missing coverage. Please review.
Project coverage is 44.32%. Comparing base (
f508e64
) to head (96cf8ff
).
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.
@andre-merzky Resource config for Frontier with Flux - we use resource manager SLURM and not FORK, right? https://github.com/radical-cybertools/radical.pilot/blob/48a8a284cae47d4dbf4c3cbc1ce6465ea7f2a374/src/radical/pilot/configs/resource_ornl.json#L142
The stack below successfully runs two flux instances across two nodes each and round-robins the tasks cross all four nodes:
$ radical-stack
python : /autofs/nccs-svm1_home1/matitov/am/ve3/bin/python3
pythonpath : /opt/cray/pe/python/3.9.13.1
version : 3.9.13
virtualenv : /autofs/nccs-svm1_home1/matitov/am/ve3
radical.gtod : 1.60.0
radical.pilot : 1.61.0-v1.60.0-38-gb7d1254ae@feature/flux_partitioning
radical.saga : 1.60.0
radical.utils : 1.61.0-v1.60.0-5-gf9194653@fix/flux_remote
Flux still doesn't see all resources and still there is an error regarding task environment
flux resources [ 0 ]:
STATE NNODES NCORES NGPUS NODELIST
free 1 1 1 frontier10417
allocated 0 0 0
down 0 0 0
Srun command list
flux command: srun -n 1 -N 1 --ntasks-per-node 1 --export=ALL \
flux start bash -c echo "HOST:$(hostname) URI:$FLUX_URI" && sleep inf
Task error message
$ cat task.000000.err
/lustre/orion/scratch/matitov/chm155/flux/radical.pilot.sandbox/rp.session.frontier10417.matitov.019873.0004//pilot.0000//gtod: error while loading shared libraries: libfabric.so.1: cannot open shared object file: No such file or directory
/lustre/orion/scratch/matitov/chm155/flux/radical.pilot.sandbox/rp.session.frontier10417.matitov.019873.0004//pilot.0000//gtod: error while loading shared libraries: libfabric.so.1: cannot open shared object file: No such file or directory
/lustre/orion/scratch/matitov/chm155/flux/radical.pilot.sandbox/rp.session.frontier10417.matitov.019873.0004/pilot.0000/task.000000//task.000000.exec.sh: line 49: module: command not found
pre_exec failed
wohoo:
(ve3) [matitov@login03.frontier pilot.0000]$ grep -C 3 allocated *log
flux.0155.log-1717146059.324 : flux.0155 : 124106 : 140731208460032 : INFO : flux resources [ 0 ]:
flux.0155.log- STATE NNODES NCORES NGPUS NODELIST
flux.0155.log- free 2 112 16 frontier[07800,10484]
flux.0155.log: allocated 0 0 0
flux.0155.log- down 0 0 0
flux.0155.log-
This is the launch command:
srun -n 2 -N 2 --ntasks-per-node 1 --cpus-per-task=112 --gpus-per-task=8 --export=ALL flux start bash -c echo "HOST:$(hostname) URI:$FLUX_URI" && sleep inf
from this:
launcher = 'srun -n %s -N %d --ntasks-per-node 1 --cpus-per-task=%d --gpus-per-task=%d --export=ALL' \
% (nodes_per_partition, nodes_per_partition, threads_per_node, gpus_per_node)
it does execute test example correctly, but within task.000000.err
, there are 2 following error messages
/lustre/orion/scratch/matitov/chm155/flux/radical.pilot.sandbox/rp.session.frontier10272.matitov.019900.0004//pilot.0000//gtod: error while loading shared libraries: libfabric.so.1: cannot open shared object file: No such file or directory
/lustre/orion/scratch/matitov/chm155/flux/radical.pilot.sandbox/rp.session.frontier10272.matitov.019900.0004//pilot.0000//gtod: error while loading shared libraries: libfabric.so.1: cannot open shared object file: No such file or directory
which affect task profiling
$ cat task.000000.prof
0.0000000,exec_start,,MainThread,task.000000,AGENT_EXECUTING,
0.0000000,exec_pre,,MainThread,task.000000,AGENT_EXECUTING,
1719379110.4348950,rank_start,,MainThread,task.000000,AGENT_EXECUTING,
1719379140.4490670,rank_stop,,MainThread,task.000000,AGENT_EXECUTING,RP_EXEC_PID=61092:RP_RANK_PID=61173
1719379140.4596250,exec_post,,MainThread,task.000000,AGENT_EXECUTING,
1719379140.4695820,exec_stop,,MainThread,task.000000,AGENT_EXECUTING,
Example of task.000000.exec.sh
is below, within the "details" section
p.s., if module
operations are in the pre-exec, then activation of lmod
is necessary -> . /sw/frontier/init/profile
, and that sets the following env variables
$ diff env.0.a.dump env.0.b.dump
7a8,9
> LMOD_MODULERCFILE=/sw/frontier/lmod/etc/rc.lua
> LMOD_SYSTEM_NAME=frontier
8a11
> MEMBERWORK=/lustre/orion/scratch/
12a16
> MPICH_OFI_NIC_POLICY=NUMA
15a20
> RFE_811452_DISABLE=1
35a41,42
> PROJWORK=/lustre/orion/proj-shared
> HWLOC_PCI_LOCALITY=/usr/share/hwloc/pci-locality-hpe-cray-ex235a
41a49
> FI_CXI_ATS=0
45a54,55
> LMOD_PACKAGE_PATH=/sw/frontier/lmod/libexec
> PATH=/sw/frontier/bin:/usr/local/bin:/usr/bin:/bin:.
52a63
> WORLDWORK=/lustre/orion/world-shared
(/lustre/orion/scratch/matitov/chm155/flux/radical.pilot.sandbox/rp.session.frontier10272.matitov.019900.0004
)
Ah, I should have linked this PR: radical-cybertools/radical.gtod/pull/20 to resolve the profiler issue.
Ah, I should have linked this PR: https://github.com/radical-cybertools/radical.gtod/pull/20 to resolve the profiler issue.
it did resolve
@andre-merzky we don't have anything else left for this PR, right? (though, we do have a final step - having a default number for partitions =1
)
@andre-merzky we don't have anything else left for this PR, right? (though, we do have a final step - having a default number for partitions
=1
)
Right - I pushed the deafault setting now.
This adds round-robin load balancing to the flux partitioning.