ChainEvaluator resource allocation

n01r commented 2 months ago

Hi, I faced an issue with the ChainEvaluator and its resource allocation.

I have two TemplateEvaluators in my chain. One is a preparatory step, ev_pre that is supposed create an HDF5 input file for the main step. The main step, ev_main, then uses 32 GPUs to run a WarpX simulation.

I allocated 128 nodes on Perlmutter to get 16 workers with 32 GPUs (meaning 8 nodes) per worker. As specified in the documentation of TemplateEvaluator, when you don't supply n_gpus or n_procs for a task, it assumes the number is 1. With a ChainEvaluator, it will allocate the maximum resources that all tasks in the chain need.

Now, the problem is that my first task really just requires the 16 workers to each run a single thread of Python. However, that task got launched with srun -w <list of 32 nodes> --ntasks 64 --nodes 64 --ntasks-per-node 1 --gpus-per-task 1 --exact /global/cfs/cdirs/m4546/mgarten/sw/perlmutter/gpu/venvs/warpx-gpu/bin/python prepare_simulation.py .

This is accompanied by

[1]  2024-08-07 19:55:18,958 libensemble.executors.mpi_runner (INFO): Adjusted ngpus to split evenly across nodes. From 32 to 64
[2]  2024-08-07 19:55:18,958 libensemble.executors.mpi_runner (INFO): Adjusted ngpus to split evenly across nodes. From 32 to 64
[1]  2024-08-07 19:55:18,958 libensemble.executors.mpi_runner (INFO): Adjusted nprocs to split evenly across nodes. From 1 to 64
[2]  2024-08-07 19:55:18,958 libensemble.executors.mpi_runner (INFO): Adjusted nprocs to split evenly across nodes. From 1 to 64

Note that ntasks-per-node and gpus-per-task are equal to 1, like I require. But I also do not need 64 tasks or 64 nodes. I need 16 tasks, like the number of my sim_workers, so for the first (shorter) step, I do not need all of the resources. But I need them for the second step of the chain.

The result is that this step fails because many threads try to create the same HDF5 file.

This issue seems related to #87.

n01r commented 2 months ago

It is interesting that I did not see this in a debug test case that I ran before. But there, I had reduced the necessary resources to only use 4 nodes. In this case, my ev_pre still needs just 1 thread per worker, and the ev_main needs 4 GPUs per worker. In that case, the first launch command reads: srun -w <specific node name> --ntasks 1 --nodes 1 --ntasks-per-node 1 --gpus-per-task 4 --exact /global/cfs/cdirs/m4546/mgarten/sw/perlmutter/gpu/venvs/warpx-gpu/bin/python prepare_simulation.py

It is a bit unclear to me, though, why suddenly gpus-per-task is 4 even though the signature of the evaluators is the same. But with this test run I did not encounter the issue I reported above.

shuds13 commented 2 months ago

@n01r

It looks to me like the sim_function in Optimas is not set up to handle this case properly when there is more than one node (there is no num_nodes passed to executor.submit, so it will use all available) . It might work to set n_procs=1 for that evaluator.

That said, I dont know why your run-line is saying 64 nodes.

shuds13 commented 2 months ago

On second thoughts, it might not need num_nodes adding, as if nprocs=1 is passed to libEnsemble, it should work.

If not, can you send my your script, I'll try to reproduce.

n01r commented 2 months ago

Thanks, @shuds13, I'll report back once I've tried that. :)

shuds13 commented 2 months ago

@n01r I found a bug in libEnsemble, where all worker-assigned GPUs get used when num_gpus=0. I've fixed that in this branch https://github.com/Libensemble/libensemble/pull/1398.

I'll check the situation on multiple nodes.

n01r commented 2 months ago

Oh awesome, thank you! I put in a job with n_procs=1 but I'm attending an event today so I couldn't do more focused tests yet.

shuds13 commented 2 months ago

It looks like its still not right over multiple nodes so I will look into that.

shuds13 commented 2 months ago

Think I've fixed the multi-node case on that branch.

n01r commented 2 months ago

The srun commands look correct and now it seems fixed to me. Thanks a lot, @shuds13! :) I used a smaller case but with the same characteristics (second evaluator is multi-node, first evaluator is single-thread and requires no GPU) and ran it on the Perlmutter debug queue.

n01r commented 2 months ago

Solved by https://github.com/Libensemble/libensemble/pull/1398.

optimas-org / optimas

ChainEvaluator resource allocation #242