Issue with determining number of valid nodes for num_tasks=0

lagerhardt commented 5 months ago

With 4.6.1, if you have a reservation and a test with num_tasks=0, the framework returns 0 node. I'm invoking the code with reframe -vvvvvvv -r -R -c checks/microbenchmarks/dgemm -J reservation=checkout -n dgemm_cpu -C nersc-config.py and here's what I see if I turn up the logging:

[F] Flexible node allocation requested
[CMD] 'scontrol -a show -o nodes'
[F] Total available nodes: 443
[CMD] 'scontrol -a show res checkout'
[CMD] 'scontrol -a show -o Nodes=login[01-07],nid[001000-001023,001033,001036-001037,001040-001041,001044-001045,001048-001049,001052-001053,001064-001065,001068-001069,001072-001073,001076-001077,001080-001081,001084-001085,001088-001089,001092-001093,200001-200257,200260-200261,200264-200265,200268-200269,200272-200273,200276-200277,200280-200281,200284-200285,200288-200289,200292-200293,200296-200297,200300-200301,200304-200305,200308-200309,200312-200313,200316-200317,200320-200321,200324-200325,200328-200329,200332-200333,200336-200337,200340-200341,200344-200345,200348-200349,200352-200353,200356-200357,200360-200361,200364-200365,200368-200369,200372-200373,200376-200377,200380-200381,200384-200385,200388-200389,200392-200393,200396-200397,200400-200401,200404-200405,200408-200409,200412-200413,200416-200417,200420-200421,200424-200425,200428-200429,200432-200433,200436-200437,200440-200441,200444-200445,200448-200449,200452-200453,200456-200457,200460-200461,200464-200465,200468-200469,200472-200473,200476-200477,200480-200481,200484-200485,200488-200489,200492-200493,200496-200497,200500-200501,200504-200505,200508-200509]'
[S] slurm: [F] Filtering nodes by reservation checkout: available nodes now: 0

There are available nodes in the reservation, though not all of them are available. Here's the list of states:

1 State=DOWN+DRAIN+MAINTENANCE+RESERVED+NOT_RESPONDING
2 State=DOWN+DRAIN+MAINTENANCE+RESERVED
5 State=DOWN+DRAIN+RESERVED+NOT_RESPONDING
1 State=DOWN+MAINTENANCE+RESERVED+NOT_RESPONDING
2 State=DOWN+MAINTENANCE+RESERVED+NOT_RESPONDING
1 State=DOWN+RESERVED+NOT_RESPONDING
1 State=DOWN+RESERVED
1 State=DOWN+RESERVED
7 State=IDLE+DRAIN+MAINTENANCE+RESERVED
2 State=IDLE+MAINTENANCE+RESERVED
45 State=IDLE+RESERVED
1 State=IDLE+RESERVED
370 State=IDLE+RESERVED
1 State=IDLE+RESERVED
1 State=IDLE
1 State=MIXED+RESERVED

I can only get a non-zero number if I add --flex-alloc-nodes=IDLE+RESERVED. I still get zero if I add --flex-alloc-nodes=IDLE. It was my understanding that asking for IDLE was supposed to match any of these states, but that doesn't seem to be the case. I suspect that the fact that it's doing an & between two node sets (at https://github.com/reframe-hpc/reframe/blob/392efbc8d9ae96754fb4af87027a77492c06f9f8/reframe/core/schedulers/slurm.py#L345) might have something to do with it. From my logging it looks like the node set is empty before it queries the nodes in the reservation.

vkarak commented 5 months ago

Hi @lagerhardt, this has changed in 4.6 (check the docs the --distribute option). To revert to the original behaviour you should use --flex-alloc-policy=idle*. The rationale behind this change is that there was no way previously to select nodes exclusively in a state and ReFrame could end up requesting nodes in IDLE+DRAIN states, which it could never get.

lagerhardt commented 5 months ago

Ah, thanks. I missed that update.

So it looks like I can roughly get close to the behavior I want with the --flex-alloc-nodes=idle+reserved flag, but I see the note that multiple tests have to be executed serially. This would be okay, except we have multiple separate tests for our GPU and CPU partitions and we don't need one to sit idle while the other is busy. In theory I could get around this by running two separate instances of reframe each targeting a single partition but that's adding a second layer of complexity. Before I start setting up to do that, I was wondering if you knew of a better way to run tests across all available nodes. If it's via this mechanism, then I was wondering if there were any plans to adjust this behavior in the near future?

vkarak commented 5 months ago

This note was valid even before. You can still run in parallel across reframe partitions. The reason behind this note is that the first test will consume all available nodes in the partition, so the next one will not find any idle nodes and will be skipped. But across partitions, this is not a problem, as ReFrame scopes the node request automatically.

We should update this note in the docs to make this clearer.

lagerhardt commented 5 months ago

The issue is I have multiple gpu and cpu full system tests I want to run. If I do that with the async execution option then the second test of whatever type fails because there’s no available nodes. If I run with the serial option it runs just a single test at a time so the CPU partition sits idle while the GPU one runs and vice versa.

In an ideal case, I’d like to be able to submit all these tests (as well as a bunch of smaller tests) at once from a single instance of reframe but I’m having trouble thinking of how to do that

On Wed, Jun 19, 2024 at 1:52 AM Vasileios Karakasis < @.***> wrote:

This note was valid even before. You can still run in parallel across reframe partitions. The reason behind this note is that the first test will consume all available nodes in the partition, so the next one will not find any idle nodes and will be skipped. But across partitions, this is not a problem, as ReFrame scopes the node request automatically.

We should update this note in the docs to make this clearer.

— Reply to this email directly, view it on GitHub https://github.com/reframe-hpc/reframe/issues/3216#issuecomment-2178128839, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACCNHQTLY4BX3FIBDOHZRFDZIFBERAVCNFSM6AAAAABJQXYSDCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCNZYGEZDQOBTHE . You are receiving this because you were mentioned.Message ID: @.***>

teojgo commented 5 months ago

One solution that might work, is to run with the async execution but limit the number of jobs per partition to 1. So each test will try to consume all the nodes for each partition but will not submit another job until the nodes are free again.

vkarak commented 5 months ago

@lagerhardt We have now added a new pseudo-state in the flexible allocation policy. You can run with --flex-alloc-policy=avail or --distribute=avail (depending on which option you use) and this will scale the tests to all the "available" nodes in each partition. Available is a node that is either in ALLOCATED, COMPLETING or IDLE states. This way you can still submit all of your tests with the async policy: those submitted later will just wait.

lagerhardt commented 5 months ago

That sounds great! Thank you!!! When would this be available?

vkarak commented 5 months ago

That sounds great! Thank you!!! When would this be available?

It is already from 4.6 :-)

lagerhardt commented 4 months ago

Sorry for the long silence, finally able to come back to this. I am still getting zero nodes even with --flex-alloc-nodes=avail gives me zero nodes for all jobs, even the first one (btw, I'm only seeing "--flex-alloc-nodes", not "--flex-alloc-policy", which is what I assume you meant). For a reservation with five nodes where only two of the nodes are up and of the proper type when I use the avail flag, it's telling me:


[2024-07-15T15:16:12] debug: dgemm_gpu /3d2d2c0e @muller:gpu_ss11+default: [CMD] 'scontrol -a show -o nodes'
[2024-07-15T15:16:12] debug: dgemm_gpu /3d2d2c0e @muller:gpu_ss11+default: [F] Total available nodes: 443
[2024-07-15T15:16:12] debug: dgemm_gpu /3d2d2c0e @muller:gpu_ss11+default: [F] Total available nodes after filter by state: 45
[2024-07-15T15:16:12] debug: dgemm_gpu /3d2d2c0e @muller:gpu_ss11+default: [CMD] 'scontrol -a show res checkout'
[2024-07-15T15:16:12] debug: dgemm_gpu /3d2d2c0e @muller:gpu_ss11+default: [CMD] 'scontrol -a show -o Nodes=nid[001004-001006,001023,001040-001041]'
[2024-07-15T15:16:12] debug: dgemm_gpu /3d2d2c0e @muller:gpu_ss11+default: [CMD] 'scontrol -a show -o partitions'
[2024-07-15T15:16:12] debug: dgemm_gpu /3d2d2c0e @muller:gpu_ss11+default: [F] Total available nodes after filternodes: 0
[2024-07-15T15:16:12] debug: dgemm_gpu /3d2d2c0e @muller:gpu_ss11+default: caught reframe.core.exceptions.JobError: [jobid=None] could not satisfy the minimum task requirement: required 4, found 0```

vkarak commented 4 months ago

Yes, I meant --flex-alloc-policy (I always mix up the name). What state are the reservation nodes in? If they are also in RESERVED state then that could explain it since avail is excluding nodes in this state. Maybe, if a reservation is also requested it would make sense for ReFrame to allow nodes in RESERVED state to be selected automatically.

lagerhardt commented 4 months ago

Yes, these are also reserved. We typically use a full system reservation after a maintenance to do checkout.

vkarak commented 4 months ago

I think we need to add better support for RESERVED nodes then. Currently you could run with --flex-alloc-policy=IDLE+RESERVED, but you would have to submit the tests serially as before.

dmargala commented 1 month ago

@vkarak , do you have any suggestions where a good place to add better support for this might be?

One way might be to extend the syntax of --flex-alloc-nodes to handle something like an "ignore these states", e.g. --flex-alloc-nodes=avail-reserved (maybe there's a better notation?). This seems flexible but I haven't really thought about it fits in more broadly.

For example, changes in schedulers.filter_nodes_by_state:

+
+    if '-' in state:
+        state, ignore_states = state.split('-')
+    else:
+        ignore_states = ''
     if state == 'avail':
-        nodelist = {n for n in nodelist if n.is_avail()}
+        nodelist = {n for n in nodelist if n.is_avail(ignore_states)}

and schedulers.slurm:

-    def in_statex(self, state):
-        return self._states == set(state.upper().split('+'))
+    def in_statex(self, state, ignore_states=''):
+        return self._states - set(ignore_states.upper().split('+')) == set(state.upper().split('+'))

-    def is_avail(self):
-        return any(self.in_statex(s)
+    def is_avail(self, ignore_states=''):
+        return any(self.in_statex(s, ignore_states=ignore_states)
                    for s in ('ALLOCATED', 'COMPLETING', 'IDLE'))

Alternatively, It might be reasonable to ignore the RESERVED state generally in is_avail(). Since in SlurmJobScheduler.filternodes at least, there's already a step to filter nodes based on the presence of a reservation option before the node state filter is applied. I'm not sure what else might be impacted by that though.

dmargala commented 1 month ago

Another issue with the existing is_avail implementation contributing to the broader issue is that it seems nodes can be both IDLE and COMPLETING at the same time. I see some nodes in an active reservation with this state:

State=IDLE+COMPLETING+RESERVED

vkarak commented 1 month ago

Another issue with the existing is_avail implementation contributing to the broader issue is that it seems nodes can be both IDLE and COMPLETING at the same time. I see some nodes in an active reservation with this state:
State=IDLE+COMPLETING+RESERVED

I think this is easily fixed if is_avail allows the node to be in any combination of the "avail" states, i.e., self._states <= {'ALLOCATED', 'COMPLETING', 'IDLE'}

For including the RESERVED nodes, I don't know yet what would be the best implementation. Ideally, we would want RESERVED to be automatically included in the "avail" states if --reservation is passed, but looking at the way the node filtering is currently implemented, it's not so straightforward without breaking the encapsulation.

vkarak commented 1 month ago

For including the RESERVED nodes, I don't know yet what would be the best implementation. Ideally, we would want RESERVED to be automatically included in the "avail" states if --reservation is passed, but looking at the way the node filtering is currently implemented, it's not so straightforward without breaking the encapsulation.

I have an idea for this. Since we do a scheduler-specific filtering anyway here, the is_avail() for Slurm could always include the RESERVED nodes and then filter them out in filternodes() if the --reservation option is not passed.

reframe-hpc / reframe

Issue with determining number of valid nodes for num_tasks=0 #3216