Closed lagerhardt closed 3 weeks ago
Hi @lagerhardt, this has changed in 4.6 (check the docs the --distribute
option). To revert to the original behaviour you should use --flex-alloc-policy=idle*
. The rationale behind this change is that there was no way previously to select nodes exclusively in a state and ReFrame could end up requesting nodes in IDLE+DRAIN
states, which it could never get.
Ah, thanks. I missed that update.
So it looks like I can roughly get close to the behavior I want with the --flex-alloc-nodes=idle+reserved
flag, but I see the note that multiple tests have to be executed serially. This would be okay, except we have multiple separate tests for our GPU and CPU partitions and we don't need one to sit idle while the other is busy. In theory I could get around this by running two separate instances of reframe each targeting a single partition but that's adding a second layer of complexity. Before I start setting up to do that, I was wondering if you knew of a better way to run tests across all available nodes. If it's via this mechanism, then I was wondering if there were any plans to adjust this behavior in the near future?
This note was valid even before. You can still run in parallel across reframe partitions. The reason behind this note is that the first test will consume all available nodes in the partition, so the next one will not find any idle nodes and will be skipped. But across partitions, this is not a problem, as ReFrame scopes the node request automatically.
We should update this note in the docs to make this clearer.
The issue is I have multiple gpu and cpu full system tests I want to run. If I do that with the async execution option then the second test of whatever type fails because there’s no available nodes. If I run with the serial option it runs just a single test at a time so the CPU partition sits idle while the GPU one runs and vice versa.
In an ideal case, I’d like to be able to submit all these tests (as well as a bunch of smaller tests) at once from a single instance of reframe but I’m having trouble thinking of how to do that
On Wed, Jun 19, 2024 at 1:52 AM Vasileios Karakasis < @.***> wrote:
This note was valid even before. You can still run in parallel across reframe partitions. The reason behind this note is that the first test will consume all available nodes in the partition, so the next one will not find any idle nodes and will be skipped. But across partitions, this is not a problem, as ReFrame scopes the node request automatically.
We should update this note in the docs to make this clearer.
— Reply to this email directly, view it on GitHub https://github.com/reframe-hpc/reframe/issues/3216#issuecomment-2178128839, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACCNHQTLY4BX3FIBDOHZRFDZIFBERAVCNFSM6AAAAABJQXYSDCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCNZYGEZDQOBTHE . You are receiving this because you were mentioned.Message ID: @.***>
One solution that might work, is to run with the async execution but limit the number of jobs per partition to 1. So each test will try to consume all the nodes for each partition but will not submit another job until the nodes are free again.
@lagerhardt We have now added a new pseudo-state in the flexible allocation policy. You can run with --flex-alloc-policy=avail
or --distribute=avail
(depending on which option you use) and this will scale the tests to all the "available" nodes in each partition. Available is a node that is either in ALLOCATED, COMPLETING or IDLE states. This way you can still submit all of your tests with the async policy: those submitted later will just wait.
That sounds great! Thank you!!! When would this be available?
That sounds great! Thank you!!! When would this be available?
It is already from 4.6 :-)
Sorry for the long silence, finally able to come back to this. I am still getting zero nodes even with --flex-alloc-nodes=avail
gives me zero nodes for all jobs, even the first one (btw, I'm only seeing "--flex-alloc-nodes", not "--flex-alloc-policy", which is what I assume you meant). For a reservation with five nodes where only two of the nodes are up and of the proper type when I use the avail flag, it's telling me:
[2024-07-15T15:16:12] debug: dgemm_gpu /3d2d2c0e @muller:gpu_ss11+default: [CMD] 'scontrol -a show -o nodes'
[2024-07-15T15:16:12] debug: dgemm_gpu /3d2d2c0e @muller:gpu_ss11+default: [F] Total available nodes: 443
[2024-07-15T15:16:12] debug: dgemm_gpu /3d2d2c0e @muller:gpu_ss11+default: [F] Total available nodes after filter by state: 45
[2024-07-15T15:16:12] debug: dgemm_gpu /3d2d2c0e @muller:gpu_ss11+default: [CMD] 'scontrol -a show res checkout'
[2024-07-15T15:16:12] debug: dgemm_gpu /3d2d2c0e @muller:gpu_ss11+default: [CMD] 'scontrol -a show -o Nodes=nid[001004-001006,001023,001040-001041]'
[2024-07-15T15:16:12] debug: dgemm_gpu /3d2d2c0e @muller:gpu_ss11+default: [CMD] 'scontrol -a show -o partitions'
[2024-07-15T15:16:12] debug: dgemm_gpu /3d2d2c0e @muller:gpu_ss11+default: [F] Total available nodes after filternodes: 0
[2024-07-15T15:16:12] debug: dgemm_gpu /3d2d2c0e @muller:gpu_ss11+default: caught reframe.core.exceptions.JobError: [jobid=None] could not satisfy the minimum task requirement: required 4, found 0```
Yes, I meant --flex-alloc-policy
(I always mix up the name). What state are the reservation nodes in? If they are also in RESERVED
state then that could explain it since avail
is excluding nodes in this state. Maybe, if a reservation is also requested it would make sense for ReFrame to allow nodes in RESERVED
state to be selected automatically.
Yes, these are also reserved. We typically use a full system reservation after a maintenance to do checkout.
I think we need to add better support for RESERVED
nodes then. Currently you could run with --flex-alloc-policy=IDLE+RESERVED
, but you would have to submit the tests serially as before.
@vkarak , do you have any suggestions where a good place to add better support for this might be?
One way might be to extend the syntax of --flex-alloc-nodes
to handle something like an "ignore these states", e.g. --flex-alloc-nodes=avail-reserved
(maybe there's a better notation?). This seems flexible but I haven't really thought about it fits in more broadly.
For example, changes in schedulers.filter_nodes_by_state
:
+
+ if '-' in state:
+ state, ignore_states = state.split('-')
+ else:
+ ignore_states = ''
if state == 'avail':
- nodelist = {n for n in nodelist if n.is_avail()}
+ nodelist = {n for n in nodelist if n.is_avail(ignore_states)}
and schedulers.slurm
:
- def in_statex(self, state):
- return self._states == set(state.upper().split('+'))
+ def in_statex(self, state, ignore_states=''):
+ return self._states - set(ignore_states.upper().split('+')) == set(state.upper().split('+'))
- def is_avail(self):
- return any(self.in_statex(s)
+ def is_avail(self, ignore_states=''):
+ return any(self.in_statex(s, ignore_states=ignore_states)
for s in ('ALLOCATED', 'COMPLETING', 'IDLE'))
Alternatively, It might be reasonable to ignore the RESERVED
state generally in is_avail()
. Since in SlurmJobScheduler.filternodes
at least, there's already a step to filter nodes based on the presence of a reservation option before the node state filter is applied. I'm not sure what else might be impacted by that though.
Another issue with the existing is_avail
implementation contributing to the broader issue is that it seems nodes can be both IDLE
and COMPLETING
at the same time. I see some nodes in an active reservation with this state:
State=IDLE+COMPLETING+RESERVED
Another issue with the existing
is_avail
implementation contributing to the broader issue is that it seems nodes can be bothIDLE
andCOMPLETING
at the same time. I see some nodes in an active reservation with this state:State=IDLE+COMPLETING+RESERVED
I think this is easily fixed if is_avail
allows the node to be in any combination of the "avail" states, i.e., self._states <= {'ALLOCATED', 'COMPLETING', 'IDLE'}
For including the RESERVED
nodes, I don't know yet what would be the best implementation. Ideally, we would want RESERVED
to be automatically included in the "avail" states if --reservation
is passed, but looking at the way the node filtering is currently implemented, it's not so straightforward without breaking the encapsulation.
For including the RESERVED nodes, I don't know yet what would be the best implementation. Ideally, we would want RESERVED to be automatically included in the "avail" states if --reservation is passed, but looking at the way the node filtering is currently implemented, it's not so straightforward without breaking the encapsulation.
I have an idea for this. Since we do a scheduler-specific filtering anyway here, the is_avail()
for Slurm could always include the RESERVED
nodes and then filter them out in filternodes()
if the --reservation
option is not passed.
With 4.6.1, if you have a reservation and a test with num_tasks=0, the framework returns 0 node. I'm invoking the code with
reframe -vvvvvvv -r -R -c checks/microbenchmarks/dgemm -J reservation=checkout -n dgemm_cpu -C nersc-config.py
and here's what I see if I turn up the logging:There are available nodes in the reservation, though not all of them are available. Here's the list of states:
I can only get a non-zero number if I add
--flex-alloc-nodes=IDLE+RESERVED
. I still get zero if I add--flex-alloc-nodes=IDLE
. It was my understanding that asking forIDLE
was supposed to match any of these states, but that doesn't seem to be the case. I suspect that the fact that it's doing an&
between two node sets (at https://github.com/reframe-hpc/reframe/blob/392efbc8d9ae96754fb4af87027a77492c06f9f8/reframe/core/schedulers/slurm.py#L345) might have something to do with it. From my logging it looks like the node set is empty before it queries the nodes in the reservation.