reframe-hpc / reframe

A powerful Python framework for writing and running portable regression tests and benchmarks for HPC systems.
https://reframe-hpc.readthedocs.org
BSD 3-Clause "New" or "Revised" License
217 stars 103 forks source link

Issue with determining number of valid nodes for num_tasks=0 #3216

Open lagerhardt opened 3 months ago

lagerhardt commented 3 months ago

With 4.6.1, if you have a reservation and a test with num_tasks=0, the framework returns 0 node. I'm invoking the code with reframe -vvvvvvv -r -R -c checks/microbenchmarks/dgemm -J reservation=checkout -n dgemm_cpu -C nersc-config.py and here's what I see if I turn up the logging:

[F] Flexible node allocation requested
[CMD] 'scontrol -a show -o nodes'
[F] Total available nodes: 443
[CMD] 'scontrol -a show res checkout'
[CMD] 'scontrol -a show -o Nodes=login[01-07],nid[001000-001023,001033,001036-001037,001040-001041,001044-001045,001048-001049,001052-001053,001064-001065,001068-001069,001072-001073,001076-001077,001080-001081,001084-001085,001088-001089,001092-001093,200001-200257,200260-200261,200264-200265,200268-200269,200272-200273,200276-200277,200280-200281,200284-200285,200288-200289,200292-200293,200296-200297,200300-200301,200304-200305,200308-200309,200312-200313,200316-200317,200320-200321,200324-200325,200328-200329,200332-200333,200336-200337,200340-200341,200344-200345,200348-200349,200352-200353,200356-200357,200360-200361,200364-200365,200368-200369,200372-200373,200376-200377,200380-200381,200384-200385,200388-200389,200392-200393,200396-200397,200400-200401,200404-200405,200408-200409,200412-200413,200416-200417,200420-200421,200424-200425,200428-200429,200432-200433,200436-200437,200440-200441,200444-200445,200448-200449,200452-200453,200456-200457,200460-200461,200464-200465,200468-200469,200472-200473,200476-200477,200480-200481,200484-200485,200488-200489,200492-200493,200496-200497,200500-200501,200504-200505,200508-200509]'
[S] slurm: [F] Filtering nodes by reservation checkout: available nodes now: 0

There are available nodes in the reservation, though not all of them are available. Here's the list of states:

1 State=DOWN+DRAIN+MAINTENANCE+RESERVED+NOT_RESPONDING
2 State=DOWN+DRAIN+MAINTENANCE+RESERVED
5 State=DOWN+DRAIN+RESERVED+NOT_RESPONDING
1 State=DOWN+MAINTENANCE+RESERVED+NOT_RESPONDING
2 State=DOWN+MAINTENANCE+RESERVED+NOT_RESPONDING
1 State=DOWN+RESERVED+NOT_RESPONDING
1 State=DOWN+RESERVED
1 State=DOWN+RESERVED
7 State=IDLE+DRAIN+MAINTENANCE+RESERVED
2 State=IDLE+MAINTENANCE+RESERVED
45 State=IDLE+RESERVED
1 State=IDLE+RESERVED
370 State=IDLE+RESERVED
1 State=IDLE+RESERVED
1 State=IDLE
1 State=MIXED+RESERVED

I can only get a non-zero number if I add --flex-alloc-nodes=IDLE+RESERVED. I still get zero if I add --flex-alloc-nodes=IDLE. It was my understanding that asking for IDLE was supposed to match any of these states, but that doesn't seem to be the case. I suspect that the fact that it's doing an & between two node sets (at https://github.com/reframe-hpc/reframe/blob/392efbc8d9ae96754fb4af87027a77492c06f9f8/reframe/core/schedulers/slurm.py#L345) might have something to do with it. From my logging it looks like the node set is empty before it queries the nodes in the reservation.

vkarak commented 3 months ago

Hi @lagerhardt, this has changed in 4.6 (check the docs the --distribute option). To revert to the original behaviour you should use --flex-alloc-policy=idle*. The rationale behind this change is that there was no way previously to select nodes exclusively in a state and ReFrame could end up requesting nodes in IDLE+DRAIN states, which it could never get.

lagerhardt commented 3 months ago

Ah, thanks. I missed that update.

So it looks like I can roughly get close to the behavior I want with the --flex-alloc-nodes=idle+reserved flag, but I see the note that multiple tests have to be executed serially. This would be okay, except we have multiple separate tests for our GPU and CPU partitions and we don't need one to sit idle while the other is busy. In theory I could get around this by running two separate instances of reframe each targeting a single partition but that's adding a second layer of complexity. Before I start setting up to do that, I was wondering if you knew of a better way to run tests across all available nodes. If it's via this mechanism, then I was wondering if there were any plans to adjust this behavior in the near future?

vkarak commented 3 months ago

This note was valid even before. You can still run in parallel across reframe partitions. The reason behind this note is that the first test will consume all available nodes in the partition, so the next one will not find any idle nodes and will be skipped. But across partitions, this is not a problem, as ReFrame scopes the node request automatically.

We should update this note in the docs to make this clearer.

lagerhardt commented 3 months ago

The issue is I have multiple gpu and cpu full system tests I want to run. If I do that with the async execution option then the second test of whatever type fails because there’s no available nodes. If I run with the serial option it runs just a single test at a time so the CPU partition sits idle while the GPU one runs and vice versa.

In an ideal case, I’d like to be able to submit all these tests (as well as a bunch of smaller tests) at once from a single instance of reframe but I’m having trouble thinking of how to do that

On Wed, Jun 19, 2024 at 1:52 AM Vasileios Karakasis < @.***> wrote:

This note was valid even before. You can still run in parallel across reframe partitions. The reason behind this note is that the first test will consume all available nodes in the partition, so the next one will not find any idle nodes and will be skipped. But across partitions, this is not a problem, as ReFrame scopes the node request automatically.

We should update this note in the docs to make this clearer.

— Reply to this email directly, view it on GitHub https://github.com/reframe-hpc/reframe/issues/3216#issuecomment-2178128839, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACCNHQTLY4BX3FIBDOHZRFDZIFBERAVCNFSM6AAAAABJQXYSDCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCNZYGEZDQOBTHE . You are receiving this because you were mentioned.Message ID: @.***>

teojgo commented 3 months ago

One solution that might work, is to run with the async execution but limit the number of jobs per partition to 1. So each test will try to consume all the nodes for each partition but will not submit another job until the nodes are free again.

vkarak commented 3 months ago

@lagerhardt We have now added a new pseudo-state in the flexible allocation policy. You can run with --flex-alloc-policy=avail or --distribute=avail (depending on which option you use) and this will scale the tests to all the "available" nodes in each partition. Available is a node that is either in ALLOCATED, COMPLETING or IDLE states. This way you can still submit all of your tests with the async policy: those submitted later will just wait.

lagerhardt commented 3 months ago

That sounds great! Thank you!!! When would this be available?

vkarak commented 3 months ago

That sounds great! Thank you!!! When would this be available?

It is already from 4.6 :-)

lagerhardt commented 2 months ago

Sorry for the long silence, finally able to come back to this. I am still getting zero nodes even with --flex-alloc-nodes=avail gives me zero nodes for all jobs, even the first one (btw, I'm only seeing "--flex-alloc-nodes", not "--flex-alloc-policy", which is what I assume you meant). For a reservation with five nodes where only two of the nodes are up and of the proper type when I use the avail flag, it's telling me:


[2024-07-15T15:16:12] debug: dgemm_gpu /3d2d2c0e @muller:gpu_ss11+default: [CMD] 'scontrol -a show -o nodes'
[2024-07-15T15:16:12] debug: dgemm_gpu /3d2d2c0e @muller:gpu_ss11+default: [F] Total available nodes: 443
[2024-07-15T15:16:12] debug: dgemm_gpu /3d2d2c0e @muller:gpu_ss11+default: [F] Total available nodes after filter by state: 45
[2024-07-15T15:16:12] debug: dgemm_gpu /3d2d2c0e @muller:gpu_ss11+default: [CMD] 'scontrol -a show res checkout'
[2024-07-15T15:16:12] debug: dgemm_gpu /3d2d2c0e @muller:gpu_ss11+default: [CMD] 'scontrol -a show -o Nodes=nid[001004-001006,001023,001040-001041]'
[2024-07-15T15:16:12] debug: dgemm_gpu /3d2d2c0e @muller:gpu_ss11+default: [CMD] 'scontrol -a show -o partitions'
[2024-07-15T15:16:12] debug: dgemm_gpu /3d2d2c0e @muller:gpu_ss11+default: [F] Total available nodes after filternodes: 0
[2024-07-15T15:16:12] debug: dgemm_gpu /3d2d2c0e @muller:gpu_ss11+default: caught reframe.core.exceptions.JobError: [jobid=None] could not satisfy the minimum task requirement: required 4, found 0```
vkarak commented 2 months ago

Yes, I meant --flex-alloc-policy (I always mix up the name). What state are the reservation nodes in? If they are also in RESERVED state then that could explain it since avail is excluding nodes in this state. Maybe, if a reservation is also requested it would make sense for ReFrame to allow nodes in RESERVED state to be selected automatically.

lagerhardt commented 2 months ago

Yes, these are also reserved. We typically use a full system reservation after a maintenance to do checkout.

vkarak commented 2 months ago

I think we need to add better support for RESERVED nodes then. Currently you could run with --flex-alloc-policy=IDLE+RESERVED, but you would have to submit the tests serially as before.