Some jobs under-reporting memory usage in monitoring

s-andrews commented 5 months ago

We've had some jobs submitted which seem to be allocating memory per CPU and are therefore doubling the amount of memory they use. The monitoring can't track this so we're under-reporting the actual usage.

For example:

$ squeue -r -O jobid,username,minmemory,numcpus,nodelist
JOBID               USER                MIN_MEMORY          CPUS                NODELIST            
4226                khans               31G                 2                   compute-0-8         
4225                khans               31G                 2                   compute-0-8         
4224                khans               31G                 2                   compute-0-8         
4223                khans               31G                 2                   compute-0-8         
4222                khans               31G                 2                   compute-0-8         
4221                khans               31G                 2                   compute-0-8         
4220                khans               31G                 2                   compute-0-8         
4219                khans               31G                 2                   compute-0-8

Shows 4226 using 31G and 2 CPUs, however:

$ scontrol show jobid -d 4226
JobId=4226 ArrayJobId=3777 ArrayTaskId=70 JobName=pseudobulk_groupsize4_cap.sh
   UserId=khans(14334) GroupId=kelsey(15003) MCS_label=N/A
   Priority=1 Nice=0 Account=khans QOS=normal
   JobState=RUNNING Reason=None Dependency=(null)
   Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
   DerivedExitCode=0:0
   RunTime=00:00:50 TimeLimit=UNLIMITED TimeMin=N/A
   SubmitTime=2024-06-02T19:23:26 EligibleTime=2024-06-02T19:23:26
   AccrueTime=2024-06-02T19:23:26
   StartTime=2024-06-03T12:02:03 EndTime=Unknown Deadline=N/A
   SuspendTime=None SecsPreSuspend=0 LastSchedEval=2024-06-03T12:02:03 Scheduler=Main
   Partition=normal AllocNode:Sid=capstone:1044005
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=compute-0-8
   BatchHost=compute-0-8
   NumNodes=1 NumCPUs=2 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
   ReqTRES=cpu=1,mem=31G,node=1,billing=1
   AllocTRES=cpu=2,mem=62G,node=1,billing=2
   Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
   JOB_GRES=(null)
     Nodes=compute-0-8 CPU_IDs=0-1 Mem=63488 GRES=
   MinCPUsNode=1 MinMemoryCPU=31G MinTmpDiskNode=0
   Features=(null) DelayBoot=00:00:00
   OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
   Command=/bi/home/khans/Gavin_Kelsey/NSN_SN/Methylation_Analysis_SK/Regression_analysis/pseudobulk_groupsize4_cap.sh
   WorkDir=/bi/home/khans/Gavin_Kelsey/NSN_SN/Methylation_Analysis_SK/Regression_analysis
   StdErr=/bi/home/khans/Gavin_Kelsey/NSN_SN/Methylation_Analysis_SK/Regression_analysis/slurm-3777_70.out
   StdIn=/dev/null
   StdOut=/bi/home/khans/Gavin_Kelsey/NSN_SN/Methylation_Analysis_SK/Regression_analysis/slurm-3777_70.out
   Power=

We can see that we're using 62G not 31.

ReqTRES=cpu=1,mem=31G,node=1,billing=1
AllocTRES=cpu=2,mem=62G,node=1,billing=2

The shift from 1CPU to 2CPUs is because of hyperthreading and allocation of physical cores not threads. The memory overallocation is going to be because of MinMemoryCPU=31G.

On a different job we get:

ReqTRES=cpu=5,mem=20G,node=1,billing=5
AllocTRES=cpu=6,mem=20G,node=1,billing=6
MinCPUsNode=5 MinMemoryNode=20G

So we correctly get memory per node, not per cpu. I can't see how to expose this in squeue so we can monitor it.

At the same time we should also note that unallocated jobs show as requesting 1CPU even if we know they're actually going to round up to the next even number because of hyperthreading.

s-andrews commented 5 months ago

So this post to the SLURM list describes the same problem.

They can't differentiate between memory per node or per cpu. As far as I can see there are no replies :-( The only option from what I can see is to run scontrol on each job and parse that to get the memory usage, which isn't ideal.

s-andrews commented 5 months ago

I've now added an additional call to scontrol so we can detect the two different types of memory allocation. I'm not sure whether this will make the system slow if we have very large numbers of jobs running but there doesn't seem to be an obvious alternative.

s-andrews / capstone_monitor

Some jobs under-reporting memory usage in monitoring #15