Closed andre-merzky closed 4 months ago
Comment from Mark A. Grondona regarding using GPUs
The "simple" test scheduler bundled with flux-core is mainly used for testing and
does not support GPUs. You'll need to build and install flux-sched - the Fluxion
graph based scheduler.
(*) I've sent a request to OLCF support to deploy flux-sched
on Frontier
This is now implemented and needs testing.
OLCF response on flux-sched
installation (solution is ready, but not available on Frontier yet)
Based on the what admin said, it can be installed on Frontier the next time they build new OS images.
So I would guess the next Frontier downtime at the soonest, though they haven't given me an exact date.
I'm waiting to hear if they have more specific information.
Updates from OLCF
It currently only works for the default set of modules i.e. you can't change the cray-mpich version
or the rocm version to a higher version than what is loaded by default. I don't know what is required
to make that work currently but will pass the information on to the software team.
Required env for Flux to run:
module load flux
module load rocm
module load craype-accel-amd-gfx90a
export MPICH_GPU_SUPPORT_ENABLED=1
Our testing on Frontier is ongoing.
Using the following stack:
$ radical-stack
python : /autofs/nccs-svm1_home1/matitov/am/ve3/bin/python3
pythonpath : /opt/cray/pe/python/3.9.13.1
version : 3.9.13
virtualenv : /autofs/nccs-svm1_home1/matitov/am/ve3
radical.gtod : 1.60.0
radical.pilot : 1.61.0-v1.60.0-14-g427f117b4@feature/frontier_flux
radical.saga : 1.60.0
radical.utils : 1.60.0
and with the latest changes committed to feature/frontier_flux
, the tasks do see all nodes:
[...]
* task.000066: DONE [0], frontier05035
* task.000067: DONE [0], frontier05036
* task.000068: DONE [0], frontier05035
* task.000069: DONE [0], frontier05036
* task.000070: DONE [0], frontier05035
* task.000071: DONE [0], frontier05035
[...]
Note that I hate to manually install pyyaml
and cffi
in the virtualenv to get the flux module to import.
Note also that psij submission failed (node configuration not available
) and thus I used SAGA - did not did into that psij problem, yet.
implemented now
Frontier requires flux startup via srun.