Open joconnor22 opened 3 years ago
Hi, Yes, the nodes with hyper-threading enabled can be difficult for the QCG-PilotJob manager to interpret, we have already struggled on some systems to correctly interpret such configurations. In general, information on available resources is collected from two sources:
scontrol show -o --detail job $SLURM_JOBID
env
Could you please send me the results of this commands run in the slurm task ?Hi, thanks for your reply.
Here is the relevant output from scontrol show -o --detail job $SLURM_JOBID
:
...
AllocNode:Sid=cirrus-services1:60172
ReqNodeList=(null)
ExcNodeList=(null)
NodeList=r1i1n35
BatchHost=r1i1n35
NumNodes=1
NumCPUs=36
NumTasks=36
CPUs/Task=1
ReqB:S:C:T=0:0:*:*
TRES=cpu=36,mem=257508M,node=1,billing=36
Socks/Node=*
NtasksPerN:B:S:C=36:0:*:*
CoreSpec=*
JOB_GRES=(null)
Nodes=r1i1n35
CPU_IDs=0-71
Mem=257508
GRES=
MinCPUsNode=36
MinMemoryCPU=7153M
MinTmpDiskNode=0
...
and from env | grep SLURM
:
...
SLURM_MEM_PER_CPU=7153
SLURM_NODEID=0
SLURM_TASK_PID=53560
SLURM_PRIO_PROCESS=0
SLURM_CPUS_PER_TASK=1
SLURM_PROCID=0
SLURMD_NODENAME=r1i1n35
SLURM_TASKS_PER_NODE=36
SLURM_NNODES=1
SLURM_GET_USER_ENV=1
SLURM_NTASKS_PER_NODE=36
SLURM_JOB_NODELIST=r1i1n35
SLURM_NODELIST=r1i1n35
SLURM_NTASKS=36
SLURM_JOB_CPUS_PER_NODE=36
SLURM_TOPOLOGY_ADDR=wholesystem.vr1i0s0.vr1i1s0,vr1i0s2.vr1i1s2...r1i1s2.r1i1n35
SLURM_WORKING_CLUSTER=cirrus:cirrus-services1:6817:8960:101
SLURM_CONF=/etc/slurm/slurm.conf
SLURM_NODE_ALIASES=(null)
SLURM_TOPOLOGY_ADDR_PATTERN=switch.switch.switch.switch.switch.switch.switch.node
SLURM_CPUS_ON_NODE=36
SLURM_JOB_NUM_NODES=1
SLURM_NPROCS=36
SLURM_EXPORT_ENV=all
SLURM_GTIDS=0
...
I removed some of the output for brevity but if I've missed something then just let me know and I'll send the full output.
Thank you for the info, can you please provide me one more info ? I'm interested with the output of lscpu -p
command launched in your slurm allocation.
Sure, no problem. Here is the output from lscpu -p
:
# The following is the parsable format, which can be fed to other
# programs. Each different item in every column has an unique ID
# starting from zero.
# CPU,Core,Socket,Node,,L1d,L1i,L2,L3
0,0,0,0,,0,0,0,0
1,1,0,0,,1,1,1,0
2,2,0,0,,2,2,2,0
3,3,0,0,,3,3,3,0
4,4,0,0,,4,4,4,0
5,5,0,0,,5,5,5,0
6,6,0,0,,6,6,6,0
7,7,0,0,,7,7,7,0
8,8,0,0,,8,8,8,0
9,9,0,0,,9,9,9,0
10,10,0,0,,10,10,10,0
11,11,0,0,,11,11,11,0
12,12,0,0,,12,12,12,0
13,13,0,0,,13,13,13,0
14,14,0,0,,14,14,14,0
15,15,0,0,,15,15,15,0
16,16,0,0,,16,16,16,0
17,17,0,0,,17,17,17,0
18,18,1,1,,18,18,18,1
19,19,1,1,,19,19,19,1
20,20,1,1,,20,20,20,1
21,21,1,1,,21,21,21,1
22,22,1,1,,22,22,22,1
23,23,1,1,,23,23,23,1
24,24,1,1,,24,24,24,1
25,25,1,1,,25,25,25,1
26,26,1,1,,26,26,26,1
27,27,1,1,,27,27,27,1
28,28,1,1,,28,28,28,1
29,29,1,1,,29,29,29,1
30,30,1,1,,30,30,30,1
31,31,1,1,,31,31,31,1
32,32,1,1,,32,32,32,1
33,33,1,1,,33,33,33,1
34,34,1,1,,34,34,34,1
35,35,1,1,,35,35,35,1
36,0,0,0,,0,0,0,0
37,1,0,0,,1,1,1,0
38,2,0,0,,2,2,2,0
39,3,0,0,,3,3,3,0
40,4,0,0,,4,4,4,0
41,5,0,0,,5,5,5,0
42,6,0,0,,6,6,6,0
43,7,0,0,,7,7,7,0
44,8,0,0,,8,8,8,0
45,9,0,0,,9,9,9,0
46,10,0,0,,10,10,10,0
47,11,0,0,,11,11,11,0
48,12,0,0,,12,12,12,0
49,13,0,0,,13,13,13,0
50,14,0,0,,14,14,14,0
51,15,0,0,,15,15,15,0
52,16,0,0,,16,16,16,0
53,17,0,0,,17,17,17,0
54,18,1,1,,18,18,18,1
55,19,1,1,,19,19,19,1
56,20,1,1,,20,20,20,1
57,21,1,1,,21,21,21,1
58,22,1,1,,22,22,22,1
59,23,1,1,,23,23,23,1
60,24,1,1,,24,24,24,1
61,25,1,1,,25,25,25,1
62,26,1,1,,26,26,26,1
63,27,1,1,,27,27,27,1
64,28,1,1,,28,28,28,1
65,29,1,1,,29,29,29,1
66,30,1,1,,30,30,30,1
67,31,1,1,,31,31,31,1
68,32,1,1,,32,32,32,1
69,33,1,1,,33,33,33,1
70,34,1,1,,34,34,34,1
71,35,1,1,,35,35,35,1
Hi, I just merged bunch of patches to the develop branch of the repository. Are you able to install QCG-PilotJob manager from this branch and check if the problem with HT @ Cirrus is solved ?
Hi, the MWE from above now runs on Cirrus without raising any exceptions and the output I get back is:
available resources: {'total_nodes': 1, 'total_cores': 36, 'used_cores': 0, 'free_cores': 36}
So it looks to me like it's now working as expected. Thanks very much!
I also tried re-running a test application using the EasyVVUQ interface with the develop branch of QCG-Pilot on my local machine. Previously it was running fine with no issues. However, with the QCG-Pilot updates I'm getting a runtime warning on exit in publisher.py:160
:
RuntimeWarning: coroutine 'wait_for' was never awaited
asyncio.wait_for(self.publisher_task, 5)
Obviously EasyVVUQ hasn't been updated for the new updates in QCG-Pilot so that could causing the issue but I just wanted to check if you think this is something that should be raised on their end or here before I close the issue?
Yes, you are right. My latest updates, especially notification mechanism, could spoil the integration with EasyVVUQ. @bartoszbosak and I will try to fix it soon. Thanks for pointing that out.
Great, thanks very much for your help with everything. Do you want me to keep this issue open in the meantime?
Let's leave this ticket open, until we resolve the integration issue with EasyVVUQ.
Hi, thanks a lot for the package!
I'm working with QCG-PilotJob via the EasyVVUQ interface and having an issue getting it working on Cirrus. Each Cirrus node contains two 18-core processors (NUMA) and each core within that supports two hardware threads.
As a MWE, the following:
gives the following exception (from here):
A simplified version of the SLURM submission script I'm using looks like this:
I'm guessing the exception is raised because of something to do with the hyperthreading not matching the SLURM resources but I can't figure out if there's a way to ignore the hardware threads and just consider the physical cores. Is this possible?