prominence-eosc / prominence

PROMINENCE server
Apache License 2.0
2 stars 0 forks source link

Allow users to probe available resources #155

Open fcasson opened 2 years ago

fcasson commented 2 years ago

Strawperson CLI syntax: prominence resources

Even better, for multi-node MPI jobs, allow to set procs-per-node and nodes automatically based on most available resources

alahiff commented 2 years ago

Even better, for multi-node MPI jobs, allow to set procs-per-node and nodes automatically based on most available resources

Could you please clarify this? Do you want to be able to specify a total number of procs, and then it will automatically run across whatever number of nodes is required in order to give this number?

fcasson commented 2 years ago

Yes, exactly. That would be the most common use case (and is in line with the philosophy that physical hardware is not something the users need to think about)

fcasson commented 2 years ago

I think it is useful both to be able to probe resources, as well as to be able to ignore them! I opened #156 for the latter since these should be seperate issues.

alahiff commented 2 years ago

I suppose having more information than just quantities of resources would also be useful, e.g. information about the types of CPUs could be useful if you want to run optimised code (e.g. AMD/Intel, what type of processor, ...) Along with this the ability to restrict jobs to specific processors (e.g. if you have code optimised for Intel with AVX512).

fcasson commented 2 years ago

In first iteration I just need to know CPUs per node, Memory per node, Site, and number of nodes of each. This could be via CLI or on the Grafana pages.

Just the same information we are already getting from you by email, will allow us to target jobs while we wait for #156 which I imagine will take a bit longer.

alahiff commented 2 years ago

Will something in this form be ok to begin with?

[
   {
      "cpus":8,
      "memory":32,
      "site":"OpenStack-STFC",
      "nodes":1
   },
   {
      "cpus":60,
      "memory":244,
      "site":"OpenStack-STFC",
      "nodes":4
   },
   {
      "cpus":16,
      "memory":32,
      "site":"OpenStack-TUBITAK",
      "nodes":8
   },
   {
      "cpus":16,
      "memory":93,
      "site":"OpenStack-UNIV-LILLE",
      "nodes":2
   }
]

I will probably need to have 2 sections, one for existing resources and another for potential resources (i.e. those which could be generated dynamically if needed). I think the dynamic resources should be separate because it's not 100% certain it will be possible to get them, e.g. if you want a lot of CPUs in a single node and the cloud itself doesn't have enough free resources.

fcasson commented 2 years ago

Yes that would be a good start. If that is total resources allocated, second step would be to see how many of each type are free (or maybe that is what you are already suggesting)

alahiff commented 2 years ago

So something like this perhaps?

[
   {
      "capacity":{
         "cpus":16,
         "memory":32
      },
      "free":{
         "cpus":0,
         "memory":0
      },
      "site":"OpenStack-TUBITAK"
   },
   {
      "capacity":{
         "cpus":60,
         "memory":244
      },
      "free":{
         "cpus":0,
         "memory":126
      },
      "site":"OpenStack-STFC"
   },
   {
      "capacity":{
         "cpus":16,
         "memory":32
      },
      "free":{
         "cpus":0,
         "memory":0
      },
      "site":"OpenStack-TUBITAK"
   },
   {
      "capacity":{
         "cpus":16,
         "memory":32
      },
      "free":{
         "cpus":2,
         "memory":4
      },
      "site":"OpenStack-TUBITAK"
   },
   {
      "capacity":{
         "cpus":8,
         "memory":32
      },
      "free":{
         "cpus":8,
         "memory":32
      },
      "site":"OpenStack-STFC"
   },
   {
      "capacity":{
         "cpus":60,
         "memory":244
      },
      "free":{
         "cpus":0,
         "memory":126
      },
      "site":"OpenStack-STFC"
   },
   {
      "capacity":{
         "cpus":16,
         "memory":32
      },
      "free":{
         "cpus":0,
         "memory":0
      },
      "site":"OpenStack-TUBITAK"
   },
   {
      "capacity":{
         "cpus":60,
         "memory":244
      },
      "free":{
         "cpus":2,
         "memory":130
      },
      "site":"OpenStack-STFC"
   },
   {
      "capacity":{
         "cpus":16,
         "memory":93
      },
      "free":{
         "cpus":2,
         "memory":66
      },
      "site":"OpenStack-UNIV-LILLE"
   },
   {
      "capacity":{
         "cpus":16,
         "memory":32
      },
      "free":{
         "cpus":2,
         "memory":4
      },
      "site":"OpenStack-TUBITAK"
   },
   {
      "capacity":{
         "cpus":16,
         "memory":32
      },
      "free":{
         "cpus":2,
         "memory":4
      },
      "site":"OpenStack-TUBITAK"
   },
   {
      "capacity":{
         "cpus":60,
         "memory":244
      },
      "free":{
         "cpus":0,
         "memory":126
      },
      "site":"OpenStack-STFC"
   },
   {
      "capacity":{
         "cpus":16,
         "memory":32
      },
      "free":{
         "cpus":0,
         "memory":0
      },
      "site":"OpenStack-TUBITAK"
   },
   {
      "capacity":{
         "cpus":16,
         "memory":32
      },
      "free":{
         "cpus":0,
         "memory":0
      },
      "site":"OpenStack-TUBITAK"
   },
   {
      "capacity":{
         "cpus":16,
         "memory":93
      },
      "free":{
         "cpus":0,
         "memory":62
      },
      "site":"OpenStack-UNIV-LILLE"
   }
]
fcasson commented 2 years ago

Now it is less clear to me what are the per node resources, What about simply

   {
      "cpus":8,
      "memory":32,
      "site":"OpenStack-STFC",
      "nodes":5
      "nodes_free": 2
   },
alahiff commented 2 years ago

What about nodes which are only partially free?

fcasson commented 2 years ago

Okay, I didn't anticipate that case (are most of the virtual nodes exclusive use or shared use?). There are two use cases we would use this information for

  1. I want to submit single node job(s) - I need to know how many cores (and mem) are available for that
  2. I want to submit multinode jobs(s) using several identical nodes - I need to know how many nodes of a given type are available (and presumably only completely free nodes can be used) and what are the properties in cores per node and mem per node
fcasson commented 2 years ago

Is there something working here or on #156 that we could start to use?

If #156 is not possible yet, I at least need to decide a default value for procs-per-node in our multi-node requests - use case 2. above. Maybe I should just plump for 16?

alahiff commented 2 years ago

156 is now working. Behind-the-scences I need to make some improvements on the server side, but from the user perspective things won't change (unless changes are requested).

In reality at the moment 16 is a reasonable default.

alahiff commented 2 years ago

The first version of the prominence resources command gives output like this:

$ prominence resources
Existing resources

Total           Free
Cpus  Memory    Cpus  Memory    Site
60    244       0     124       OpenStack-STFC
8     32        8     32        OpenStack-STFC
60    244       0     124       OpenStack-STFC
32    94        0     30        OpenStack-TUBITAK
32    94        32    94        OpenStack-TUBITAK
60    244       12    148       OpenStack-STFC
16    47        0     15        OpenStack-TUBITAK
16    47        0     15        OpenStack-TUBITAK
16    32        16    32        OpenStack-TUBITAK
60    244       12    148       OpenStack-STFC
32    94        16    62        OpenStack-TUBITAK
64    472       64    472       OpenStack-MetaCentrum

Potential resources

--coming soon--

Each line corresponds to an existing worker node which is allowed to run jobs by the user making the request.

I will make this available on Monday. The next step will be to also list potential resources, i.e. resources which don't exist yet but could be created. This won't be 100% accurate as no private or public clouds can tell you if it's definitely possible to create a VM of a particular size, unless you actually do it.

fcasson commented 2 years ago

Would it be most helpful to sort the resources by free cpus? If the list gets long, identical entries could be grouped somehow

EDIT: prominence resources | uniq -c works well (with client v0.17.0)

alahiff commented 2 years ago

Yes, I had already thought about sorting by free cpus (descending) to make it clearer. And probably don't display items with no free cpus by default.