share_allocation: max -- asynchronous launching

j-ogas commented 7 months ago

Pav2 is the only test harness I've found that allows me to specify a number of nodes and execute all subsequent jobs on them (thank you). This is achieved as follows:

modes/share.yaml

scheduler: slurm
schedule:
  nodes: 1
  share_allocation: max

However, when looking at the results output, it appears that these jobs are launched serially, rather than asynchronously. See below.

Edited pav results output showing launch times.

11:20:24
11:20:19
11:20:16
11:20:12
11:20:03
11:19:53
11:19:43
11:19:34
11:19:30
11:19:27
11:19:24
11:19:20

Note that all of these tests are a single rank, thus they should be able to be launched with srun using the following srun args.

  slurm:
    srun_extra:
     - --overlap

One potential issue is overwhelming SLURM. Perhaps adding another key, e.g. max_queue, that limits the number of asynchronous jobs that can be put in the srun queue will be helpful. Perhaps something as follows.

modes/share.yaml

scheduler: slurm
schedule:
  nodes: 1
  share_allocation: max
  max_queue: 250
  slurm:
    srun_extra:
     - --overlap 
     - --gres=craynetwork:0

Paul-Ferrell commented 7 months ago

Currently the kickoff scripts simply have a pav _run command for each test to run in an allocation, which is why this is synchronous.

What we need to do is expand pav _run so that it can take multiple tests as an argument, and then manage those tests by their max_queue setting. This should look at the number of tasks each test requires via the scheduler variables, and count that against the total queue size. Note that queue size can vary from test to test (unless we make it one of the parameters that forces allocation separation), so it will be necessary to manage the number of tests dynamically. So if we have tests with 1, 2, 4, and 12 max_queue, then the size 1 test will run by itself, then any pair of the size 2, 4, and 12 tasks could run together.

I think we need a better name than max_queue. Maybe max_share_tasks?

j-ogas commented 7 months ago

One quick clarification: the hope would be that max_queue sets the limit of active jobs in queue. The hope would be that if I need 2000 tests to run on this single node, there is an upper limit of max_queue at any given time until all 2K tests complete.

Paul-Ferrell commented 1 month ago

This is done.

hpc / pavilion2

share_allocation: max -- asynchronous launching #725