Closed n01r closed 2 months ago
It is interesting that I did not see this in a debug test case that I ran before.
But there, I had reduced the necessary resources to only use 4 nodes. In this case, my ev_pre
still needs just 1 thread per worker, and the ev_main
needs 4 GPUs per worker. In that case, the first launch command reads:
srun -w <specific node name> --ntasks 1 --nodes 1 --ntasks-per-node 1 --gpus-per-task 4 --exact /global/cfs/cdirs/m4546/mgarten/sw/perlmutter/gpu/venvs/warpx-gpu/bin/python prepare_simulation.py
It is a bit unclear to me, though, why suddenly gpus-per-task
is 4 even though the signature of the evaluators is the same. But with this test run I did not encounter the issue I reported above.
@n01r
It looks to me like the sim_function in Optimas is not set up to handle this case properly when there is more than one node (there is no num_nodes passed to executor.submit, so it will use all available) . It might work to set n_procs=1 for that evaluator.
That said, I dont know why your run-line is saying 64 nodes.
On second thoughts, it might not need num_nodes adding, as if nprocs=1 is passed to libEnsemble, it should work.
If not, can you send my your script, I'll try to reproduce.
Thanks, @shuds13, I'll report back once I've tried that. :)
@n01r I found a bug in libEnsemble, where all worker-assigned GPUs get used when num_gpus=0. I've fixed that in this branch https://github.com/Libensemble/libensemble/pull/1398.
I'll check the situation on multiple nodes.
Oh awesome, thank you! I put in a job with n_procs=1
but I'm attending an event today so I couldn't do more focused tests yet.
It looks like its still not right over multiple nodes so I will look into that.
Think I've fixed the multi-node case on that branch.
The srun
commands look correct and now it seems fixed to me. Thanks a lot, @shuds13! :)
I used a smaller case but with the same characteristics (second evaluator is multi-node, first evaluator is single-thread and requires no GPU) and ran it on the Perlmutter debug queue.
Hi, I faced an issue with the
ChainEvaluator
and its resource allocation.I have two
TemplateEvaluator
s in my chain. One is a preparatory step,ev_pre
that is supposed create an HDF5 input file for the main step. The main step,ev_main
, then uses 32 GPUs to run a WarpX simulation.I allocated 128 nodes on Perlmutter to get 16 workers with 32 GPUs (meaning 8 nodes) per worker. As specified in the documentation of
TemplateEvaluator
, when you don't supplyn_gpus
orn_procs
for a task, it assumes the number is 1. With aChainEvaluator
, it will allocate the maximum resources that all tasks in the chain need.Now, the problem is that my first task really just requires the 16 workers to each run a single thread of Python. However, that task got launched with
srun -w <list of 32 nodes> --ntasks 64 --nodes 64 --ntasks-per-node 1 --gpus-per-task 1 --exact /global/cfs/cdirs/m4546/mgarten/sw/perlmutter/gpu/venvs/warpx-gpu/bin/python prepare_simulation.py
.This is accompanied by
Note that
ntasks-per-node
andgpus-per-task
are equal to 1, like I require. But I also do not need 64 tasks or 64 nodes. I need 16 tasks, like the number of mysim_workers
, so for the first (shorter) step, I do not need all of the resources. But I need them for the second step of the chain.The result is that this step fails because many threads try to create the same HDF5 file.
This issue seems related to #87.