wrong queue parameters on CCA

etiennesky commented 6 years ago

When I submit a hiresclim2 job with the scripts/hc.sh the request is for 24 nodes, which is a waste of resources. It should run on 1 node with 12 cores only.

etiennesky commented 6 years ago

here is output of qstats:

ccapar: 
                                                            Req'd  Req'd   Elap
Job ID          Username Queue    Jobname    SessID NDS TSK Memory Time  S Time
--------------- -------- -------- ---------- ------ --- --- ------ ----- - -----
7794282.ccapar  c3et     nf       hc_a13s_18    --   25  24   24gb 27:00 Q   --

plesager commented 6 years ago

This information is erroneous here as far as NDS goes, since it is an openMP-like job (we request threads and not tasks). It should report 2 for NDS: one for our 12 physical cores, one for accounting. The latter is not billed, just needed by ECMWF. If you look at the epilogue at the end of the job log:

INFO
INFO MOM RESOURCES USED | ncpus cput runtime vmem mem
INFO ---------------------------------------------------------------------------- INFO 23.05.2018 - 16:11 | 24 315 360 274876kb 122296kb
INFO

The vmem and mem information is not valid for parallele jobs. The ncpus (here 24) is the value that PBS considers is as if you have been using hyperthreading (i.e. EC_hyperthreads=2, while we use 1). In other words, for accounting purposes it is counting the number of logical cores. So 12*2=24, it is exactly what we expect.

The bottom line is that when requesting nf queue, the NDS is not valid. I did a quick test with requesting (12 tasks, 1 thread-per-task) or (1 task, 12 threads-per-task) in that fractional queue: in both cases, one node, 12 physical cores are used, 24 logical cores are billed, and the qstat reports 25 NDS.

plesager commented 6 years ago

While looking into this, I clean up a bit the template job script. Does not change a thing performance wise. See 2c7011e.

etiennesky commented 6 years ago

Thanks I did not dig too deep into the actual accounting that was used.

plesager / ece3-postproc

wrong queue parameters on CCA #30