payu-org / payu

A workflow management tool for numerical models on the NCI computing systems
Apache License 2.0
18 stars 26 forks source link

default nodesize #217

Open navidcy opened 4 years ago

navidcy commented 4 years ago

On raijin the default node size was, I believe, 16 and when we wanted to use normalbw queue we had to add in config.yaml

platform:
    nodesize: 32

Now with gadi, should we change the default nodesize to 48?

marshallward commented 4 years ago

Maybe a better idea is to use pbsnodes to parse and get this number?

aidanheerdegen commented 4 years ago

Whilst that information is available using pbsnodes:

$ pbsnodes -F json gadi-cpu-clx-0470
{
    "timestamp":1579661565,
    "pbs_version":"19.2.4.20190830141245",
    "pbs_server":"gadi-pbs-01",
    "nodes":{
        "gadi-cpu-clx-0470":{
            "Mom":"gadi-cpu-clx-0470.gadi.nci.org.au",
            "ntype":"PBS",
            "state":"job-busy",
            "pcpus":96,
            "jobs":[
                "1079829.gadi-pbs"
            ],
            "resources_available":{
                "arch":"linux",
                "host":"gadi-cpu-clx-0470",
                "jobfs":"429496729600b",
                "mem":"213647360kb",
                "ncpus":48,
                "ngpus":0,
                "topology":"rack-23-ib2,rack-23,group-1,cpu-clx",
                "vmem":"213647360kb",
                "vnode":"gadi-cpu-clx-0470"
            },
            "resources_assigned":{
                "jobfs":"102400kb",
                "mem":"41943040kb",
                "ncpus":48
            },
            "comment":"offlined by hook 'begin_checknode' due to hook error",
            "resv_enable":"True",
            "sharing":"default_shared",
            "license":"l",
            "last_state_change_time":1579659651,
            "last_used_time":1579659544
        }
    }
}

I've had a look but it isn't clear to me how we would know which node to query for that information a priori, i.e. a mapping of queue name to nodes.

aidanheerdegen commented 4 years ago

I'm wondering about a system config file with some of this information, something like:

gadi.nci.org.au:
    normal:
          ncpus: 48
          mem: 256GB
    normalbw:
          ncpus: 28
          mem: 128GB

that could live somewhere in the {{payu}} directory. Maybe a {{platform}} directory?

navidcy commented 4 years ago

And when the user selects a queue the rest is determined automatically by payu? This sounds nice.

Include also express and expressbw in that case.

(Btw, isn’t 190GB RAM per node that gadi has on normal)?

aidanheerdegen commented 4 years ago

Yeah the idea is it would be automatic. I was just putting in a couple of examples to show the idea, we would have all the available queues in there if the idea was adopted. Similarly the numbers were just for illustration.