radical-cybertools / radical.pilot

RADICAL-Pilot
http://radical-cybertools.github.io/radical-pilot/index.html
Other
54 stars 23 forks source link

flux startup fixes #3138

Closed andre-merzky closed 4 months ago

andre-merzky commented 7 months ago

Frontier requires flux startup via srun.

mtitov commented 6 months ago

Comment from Mark A. Grondona regarding using GPUs

The "simple" test scheduler bundled with flux-core is mainly used for testing and 
does not support GPUs. You'll need to build and install flux-sched - the Fluxion 
graph based scheduler.

(*) I've sent a request to OLCF support to deploy flux-sched on Frontier

andre-merzky commented 5 months ago

This is now implemented and needs testing.

mtitov commented 5 months ago

OLCF response on flux-sched installation (solution is ready, but not available on Frontier yet)

Based on the what admin said, it can be installed on Frontier the next time they build new OS images. 
So I would guess the next Frontier downtime at the soonest, though they haven't given me an exact date. 
I'm waiting to hear if they have more specific information.
mtitov commented 5 months ago

Updates from OLCF

It currently only works for the default set of modules i.e. you can't change the cray-mpich version 
or the rocm version to a higher version than what is loaded by default. I don't know what is required 
to make that work currently but will pass the information on to the software team.

Required env for Flux to run:

module load flux
module load rocm
module load craype-accel-amd-gfx90a
export MPICH_GPU_SUPPORT_ENABLED=1

Our testing on Frontier is ongoing.

andre-merzky commented 4 months ago

Using the following stack:

$ radical-stack

  python               : /autofs/nccs-svm1_home1/matitov/am/ve3/bin/python3
  pythonpath           : /opt/cray/pe/python/3.9.13.1
  version              : 3.9.13
  virtualenv           : /autofs/nccs-svm1_home1/matitov/am/ve3

  radical.gtod         : 1.60.0
  radical.pilot        : 1.61.0-v1.60.0-14-g427f117b4@feature/frontier_flux
  radical.saga         : 1.60.0
  radical.utils        : 1.60.0

and with the latest changes committed to feature/frontier_flux, the tasks do see all nodes:

[...]
  * task.000066: DONE [0], frontier05035
  * task.000067: DONE [0], frontier05036
  * task.000068: DONE [0], frontier05035
  * task.000069: DONE [0], frontier05036
  * task.000070: DONE [0], frontier05035
  * task.000071: DONE [0], frontier05035
[...]

Note that I hate to manually install pyyaml and cffi in the virtualenv to get the flux module to import. Note also that psij submission failed (node configuration not available) and thus I used SAGA - did not did into that psij problem, yet.

andre-merzky commented 4 months ago

implemented now