psnc-qcg / QCG-PilotJob

The QCG Pilot Job service for execution of many computing tasks inside one allocation
Apache License 2.0
11 stars 2 forks source link

SlurmEnvError: failed to parse cpu binding: the node binding for node... #140

Open joconnor22 opened 3 years ago

joconnor22 commented 3 years ago

Hi, thanks a lot for the package!

I'm working with QCG-PilotJob via the EasyVVUQ interface and having an issue getting it working on Cirrus. Each Cirrus node contains two 18-core processors (NUMA) and each core within that supports two hardware threads.

As a MWE, the following:

# test.py
from qcg.pilotjob.api.manager import LocalManager
manager = LocalManager()
print('available resources: ', manager.resources())

gives the following exception (from here):

qcg.pilotjob.errors.SlurmEnvError: failed to parse cpu binding: the node binding for node (r1i7n29) (72) differs from cores per node 36

A simplified version of the SLURM submission script I'm using looks like this:

#!/bin/bash

#SBATCH --nodes=1
#SBATCH --tasks-per-node=36
#SBATCH --cpus-per-task=1

pipenv run python test.py

I'm guessing the exception is raised because of something to do with the hyperthreading not matching the SLURM resources but I can't figure out if there's a way to ignore the hardware threads and just consider the physical cores. Is this possible?

pkopta commented 3 years ago

Hi, Yes, the nodes with hyper-threading enabled can be difficult for the QCG-PilotJob manager to interpret, we have already struggled on some systems to correctly interpret such configurations. In general, information on available resources is collected from two sources:

joconnor22 commented 3 years ago

Hi, thanks for your reply.

Here is the relevant output from scontrol show -o --detail job $SLURM_JOBID:

...
AllocNode:Sid=cirrus-services1:60172
ReqNodeList=(null)
ExcNodeList=(null)
NodeList=r1i1n35
BatchHost=r1i1n35
NumNodes=1
NumCPUs=36
NumTasks=36
CPUs/Task=1
ReqB:S:C:T=0:0:*:*
TRES=cpu=36,mem=257508M,node=1,billing=36
Socks/Node=*
NtasksPerN:B:S:C=36:0:*:*
CoreSpec=*
JOB_GRES=(null)
Nodes=r1i1n35
CPU_IDs=0-71
Mem=257508
GRES=
MinCPUsNode=36
MinMemoryCPU=7153M
MinTmpDiskNode=0
...

and from env | grep SLURM:

...
SLURM_MEM_PER_CPU=7153
SLURM_NODEID=0
SLURM_TASK_PID=53560
SLURM_PRIO_PROCESS=0
SLURM_CPUS_PER_TASK=1
SLURM_PROCID=0
SLURMD_NODENAME=r1i1n35
SLURM_TASKS_PER_NODE=36
SLURM_NNODES=1
SLURM_GET_USER_ENV=1
SLURM_NTASKS_PER_NODE=36
SLURM_JOB_NODELIST=r1i1n35
SLURM_NODELIST=r1i1n35
SLURM_NTASKS=36
SLURM_JOB_CPUS_PER_NODE=36
SLURM_TOPOLOGY_ADDR=wholesystem.vr1i0s0.vr1i1s0,vr1i0s2.vr1i1s2...r1i1s2.r1i1n35
SLURM_WORKING_CLUSTER=cirrus:cirrus-services1:6817:8960:101
SLURM_CONF=/etc/slurm/slurm.conf
SLURM_NODE_ALIASES=(null)
SLURM_TOPOLOGY_ADDR_PATTERN=switch.switch.switch.switch.switch.switch.switch.node
SLURM_CPUS_ON_NODE=36
SLURM_JOB_NUM_NODES=1
SLURM_NPROCS=36
SLURM_EXPORT_ENV=all
SLURM_GTIDS=0
...

I removed some of the output for brevity but if I've missed something then just let me know and I'll send the full output.

pkopta commented 3 years ago

Thank you for the info, can you please provide me one more info ? I'm interested with the output of lscpu -p command launched in your slurm allocation.

joconnor22 commented 3 years ago

Sure, no problem. Here is the output from lscpu -p:

# The following is the parsable format, which can be fed to other
# programs. Each different item in every column has an unique ID
# starting from zero.
# CPU,Core,Socket,Node,,L1d,L1i,L2,L3
0,0,0,0,,0,0,0,0
1,1,0,0,,1,1,1,0
2,2,0,0,,2,2,2,0
3,3,0,0,,3,3,3,0
4,4,0,0,,4,4,4,0
5,5,0,0,,5,5,5,0
6,6,0,0,,6,6,6,0
7,7,0,0,,7,7,7,0
8,8,0,0,,8,8,8,0
9,9,0,0,,9,9,9,0
10,10,0,0,,10,10,10,0
11,11,0,0,,11,11,11,0
12,12,0,0,,12,12,12,0
13,13,0,0,,13,13,13,0
14,14,0,0,,14,14,14,0
15,15,0,0,,15,15,15,0
16,16,0,0,,16,16,16,0
17,17,0,0,,17,17,17,0
18,18,1,1,,18,18,18,1
19,19,1,1,,19,19,19,1
20,20,1,1,,20,20,20,1
21,21,1,1,,21,21,21,1
22,22,1,1,,22,22,22,1
23,23,1,1,,23,23,23,1
24,24,1,1,,24,24,24,1
25,25,1,1,,25,25,25,1
26,26,1,1,,26,26,26,1
27,27,1,1,,27,27,27,1
28,28,1,1,,28,28,28,1
29,29,1,1,,29,29,29,1
30,30,1,1,,30,30,30,1
31,31,1,1,,31,31,31,1
32,32,1,1,,32,32,32,1
33,33,1,1,,33,33,33,1
34,34,1,1,,34,34,34,1
35,35,1,1,,35,35,35,1
36,0,0,0,,0,0,0,0
37,1,0,0,,1,1,1,0
38,2,0,0,,2,2,2,0
39,3,0,0,,3,3,3,0
40,4,0,0,,4,4,4,0
41,5,0,0,,5,5,5,0
42,6,0,0,,6,6,6,0
43,7,0,0,,7,7,7,0
44,8,0,0,,8,8,8,0
45,9,0,0,,9,9,9,0
46,10,0,0,,10,10,10,0
47,11,0,0,,11,11,11,0
48,12,0,0,,12,12,12,0
49,13,0,0,,13,13,13,0
50,14,0,0,,14,14,14,0
51,15,0,0,,15,15,15,0
52,16,0,0,,16,16,16,0
53,17,0,0,,17,17,17,0
54,18,1,1,,18,18,18,1
55,19,1,1,,19,19,19,1
56,20,1,1,,20,20,20,1
57,21,1,1,,21,21,21,1
58,22,1,1,,22,22,22,1
59,23,1,1,,23,23,23,1
60,24,1,1,,24,24,24,1
61,25,1,1,,25,25,25,1
62,26,1,1,,26,26,26,1
63,27,1,1,,27,27,27,1
64,28,1,1,,28,28,28,1
65,29,1,1,,29,29,29,1
66,30,1,1,,30,30,30,1
67,31,1,1,,31,31,31,1
68,32,1,1,,32,32,32,1
69,33,1,1,,33,33,33,1
70,34,1,1,,34,34,34,1
71,35,1,1,,35,35,35,1
pkopta commented 3 years ago

Hi, I just merged bunch of patches to the develop branch of the repository. Are you able to install QCG-PilotJob manager from this branch and check if the problem with HT @ Cirrus is solved ?

joconnor22 commented 3 years ago

Hi, the MWE from above now runs on Cirrus without raising any exceptions and the output I get back is:

available resources:  {'total_nodes': 1, 'total_cores': 36, 'used_cores': 0, 'free_cores': 36}

So it looks to me like it's now working as expected. Thanks very much!

I also tried re-running a test application using the EasyVVUQ interface with the develop branch of QCG-Pilot on my local machine. Previously it was running fine with no issues. However, with the QCG-Pilot updates I'm getting a runtime warning on exit in publisher.py:160:

RuntimeWarning: coroutine 'wait_for' was never awaited
  asyncio.wait_for(self.publisher_task, 5)

Obviously EasyVVUQ hasn't been updated for the new updates in QCG-Pilot so that could causing the issue but I just wanted to check if you think this is something that should be raised on their end or here before I close the issue?

pkopta commented 3 years ago

Yes, you are right. My latest updates, especially notification mechanism, could spoil the integration with EasyVVUQ. @bartoszbosak and I will try to fix it soon. Thanks for pointing that out.

joconnor22 commented 3 years ago

Great, thanks very much for your help with everything. Do you want me to keep this issue open in the meantime?

pkopta commented 3 years ago

Let's leave this ticket open, until we resolve the integration issue with EasyVVUQ.