Closed etiennesky closed 6 years ago
here is output of qstats:
ccapar:
Req'd Req'd Elap
Job ID Username Queue Jobname SessID NDS TSK Memory Time S Time
--------------- -------- -------- ---------- ------ --- --- ------ ----- - -----
7794282.ccapar c3et nf hc_a13s_18 -- 25 24 24gb 27:00 Q --
This information is erroneous here as far as NDS goes, since it is an openMP-like job (we request threads and not tasks). It should report 2 for NDS: one for our 12 physical cores, one for accounting. The latter is not billed, just needed by ECMWF. If you look at the epilogue at the end of the job log:
INFO
INFO MOM RESOURCES USED | ncpus cput runtime vmem mem
INFO ---------------------------------------------------------------------------- INFO 23.05.2018 - 16:11 | 24 315 360 274876kb 122296kb
INFO
The vmem and mem information is not valid for parallele jobs. The ncpus (here 24) is the value that PBS considers is as if you have been using hyperthreading (i.e. EC_hyperthreads=2
, while we use 1). In other words, for accounting purposes it is counting the number of logical cores. So 12*2=24, it is exactly what we expect.
The bottom line is that when requesting nf queue
, the NDS is not valid. I did a quick test with requesting (12 tasks, 1 thread-per-task) or (1 task, 12 threads-per-task) in that fractional queue: in both cases, one node, 12 physical cores are used, 24 logical cores are billed, and the qstat reports 25 NDS.
While looking into this, I clean up a bit the template job script. Does not change a thing performance wise. See 2c7011e.
Thanks I did not dig too deep into the actual accounting that was used.
When I submit a hiresclim2 job with the scripts/hc.sh the request is for 24 nodes, which is a waste of resources. It should run on 1 node with 12 cores only.