Closed s-andrews closed 5 months ago
So this post to the SLURM list describes the same problem.
They can't differentiate between memory per node or per cpu. As far as I can see there are no replies :-( The only option from what I can see is to run scontrol on each job and parse that to get the memory usage, which isn't ideal.
I've now added an additional call to scontrol so we can detect the two different types of memory allocation. I'm not sure whether this will make the system slow if we have very large numbers of jobs running but there doesn't seem to be an obvious alternative.
We've had some jobs submitted which seem to be allocating memory per CPU and are therefore doubling the amount of memory they use. The monitoring can't track this so we're under-reporting the actual usage.
For example:
Shows 4226 using 31G and 2 CPUs, however:
We can see that we're using 62G not 31.
The shift from 1CPU to 2CPUs is because of hyperthreading and allocation of physical cores not threads. The memory overallocation is going to be because of
MinMemoryCPU=31G
.On a different job we get:
So we correctly get memory per node, not per cpu. I can't see how to expose this in squeue so we can monitor it.
At the same time we should also note that unallocated jobs show as requesting 1CPU even if we know they're actually going to round up to the next even number because of hyperthreading.