Flux scaling tests on Frontier@OLCF

radical-cybertools / radical.pilot

RADICAL-Pilot

http://radical-cybertools.github.io/radical-pilot/index.html

Other

54 stars 23 forks source link

Flux scaling tests on Frontier@OLCF #3122

Closed mtitov closed 7 months ago

mtitov commented 8 months ago

Connected to https://github.com/radical-cybertools/radical.pilot/issues/3114

andre-merzky commented 7 months ago

RP flux scaling showed the following behavior:

rp session login11 matitov 019765 0003 state

so we see a limited number of tasks running concurrently (this is on 30 nodes, so there are sufficient resources available to concurrently run all tasks).

andre-merzky commented 7 months ago

Running a similar workload across 10 nodes in plain flux shows similar results. We obtain an allocation and run the tasks with:

> salloc -A CHM155_003 -t 0:30:00 -p batch -N 10
> flux start
> for x in $(seq 10); do flux submit ./test.sh; done

we get flux

andre-merzky commented 7 months ago

Oh, we need to start flux differently!

> salloc -A CHM155_003 -t 0:30:00 -p batch -N 10 --ntasks-per-node=56
> srun --pty flux start
> for x in $(seq $((56 * 10 * 3)) ); do flux submit ./test.sh 300 > /dev/null; done

# some data mangling

> radical-analytics-plot.py -s line -y 2,1 -a png -f flux -X task -Y 'time [s]' -L 'stop,start' -t 'flux: 10 nodes, 1K tasks' flux.dat

gives the expected concurrency of 560 tasks:

flux

PS: Note that flux only manages to start about 5 tasks per second

mturilli commented 7 months ago

Flux allows to avoid the max number of concurrent tasks (~100) imposed by the SLURM configuration on Frontier. Nonetheless, Flux scheduler is much slower than RP scheduler. The next step is to be able to run multiple Flux instances concurrently, i.e., extending the extend the flux executor in RP.