Closed mtitov closed 7 months ago
RP flux scaling showed the following behavior:
so we see a limited number of tasks running concurrently (this is on 30 nodes, so there are sufficient resources available to concurrently run all tasks).
Running a similar workload across 10 nodes in plain flux shows similar results. We obtain an allocation and run the tasks with:
> salloc -A CHM155_003 -t 0:30:00 -p batch -N 10
> flux start
> for x in $(seq 10); do flux submit ./test.sh; done
we get
Oh, we need to start flux differently!
> salloc -A CHM155_003 -t 0:30:00 -p batch -N 10 --ntasks-per-node=56
> srun --pty flux start
> for x in $(seq $((56 * 10 * 3)) ); do flux submit ./test.sh 300 > /dev/null; done
# some data mangling
> radical-analytics-plot.py -s line -y 2,1 -a png -f flux -X task -Y 'time [s]' -L 'stop,start' -t 'flux: 10 nodes, 1K tasks' flux.dat
gives the expected concurrency of 560 tasks:
PS: Note that flux only manages to start about 5 tasks per second
Flux allows to avoid the max number of concurrent tasks (~100) imposed by the SLURM configuration on Frontier. Nonetheless, Flux scheduler is much slower than RP scheduler. The next step is to be able to run multiple Flux instances concurrently, i.e., extending the extend the flux executor in RP.
Connected to https://github.com/radical-cybertools/radical.pilot/issues/3114