ucsf-wynton / wynton-website-hpc

The Official Wynton HPC User Website
https://wynton.ucsf.edu/hpc/
2 stars 14 forks source link

GPU shares: add table of GPU shares #24

Open HenrikBengtsson opened 3 years ago

HenrikBengtsson commented 3 years ago

The information on GPU shares can be found via

$ SGE_SINGLE_LINE=true qconf -srqs shared_gpu_limits
{
   name         shared_gpu_limits
   description  "Limits on use of owned GPUs by non-owners"
   enabled      TRUE
   limit        projects {!rosenberglab} queues gpu.q hosts {@rosenberglabgpunodes} to h_rt=7200
   limit        projects {!agardlab,!chenglab,!frostlab,!stroudlab,!gpu-collab} queues gpu.q hosts {@msggpunodes} to h_rt=7200
   limit        projects {!grabelab} queues gpu.q hosts {@grabelabgpunodes} to h_rt=7200
   limit        projects {!genetics,!arnaoutlab} queues gpu.q hosts qb3-iogpu1 to h_rt=7200
   limit        projects {!huanglab} queues gpu.q hosts {@huanglabgpunodes} to h_rt=7200
   limit        projects {!jacobsonlab,!arnaoutlab} queues gpu.q hosts qb3-idgpu6 to h_rt=7200
   limit        projects {!gladstone,!pollardlab} queues gpu.q hosts {@gladstonegpunodes} to h_rt=7200
   limit        projects {!i4h} queues gpu.q hosts {@i4hgpunodes} to h_rt=7200
}

For example,

   limit        projects {!genetics,!arnaoutlab} queues gpu.q hosts qb3-iogpu1 to h_rt=7200

tells SGE that "Except for members part of lab genetics or arnaoutlab should have their jobs limited to 2 hours (=7200 seconds) on host qb3-iogpu1.

Another example is:

   limit        projects {!rosenberglab} queues gpu.q hosts {@rosenberglabgpunodes} to h_rt=7200

tells SGE that "Except for members part of lab rosenberglab should have their jobs limited to 2 hours (=7200 seconds) on hosts part of the @rosenberglabgpunodes hostgroup. To see which these compute nodes are, call:

$ SGE_SINGLE_LINE=true qconf -shgrp @rosenberglabgpunodes
group_name @rosenberglabgpunodes
hostlist msg-ihgpu5

To see what hostgroups exits;

SGE_SINGLE_LINE=true qconf -shgrpl
@allhosts
@bigmem
@gladstonegpunodes
@gpunodes
@grabelabgpunodes
@huanglabgpunodes
@i4hgpunodes
@instnodes
@jacobsonlabgpunodes
@membernodes
@msggpunodes
@n106
@rosenberglabgpunodes
@testnodes
@v100nodes

Appendix

From man qconf:

       -srqs [rqs_name_list]         <show RQS configuration>
              Show the definition of the resource quota sets (RQS) specified by the argument.

From man sge_resource_quota:


...
      The tags for expressing a resource quota rule are:
...
       projects
              Contains a comma-separated list of projects (see project(5)).   This
              parameter filters jobs requesting a project in the list. Any project
              not in the list will not be considered for the resource quota  rule.
              If  no  project  filter is specified, all projects, and jobs with no
              requested project, match the rule. The value '*' means all jobs with
              requested projects. To exclude a project from the rule, the name can
              be prefixed with '!'.  The  value  '!*'  means  only  jobs  with  no
              project requested.
HenrikBengtsson commented 3 years ago

Added:

  1. wynton gpushares to wynton-tools
  2. `make assets/data/gpu_shares.tsv' to https://github.com/UCSF-HPC/wynton/blob/master/docs/Makefile
  3. Added prototype page https://wynton.ucsf.edu/hpc/about/gpu-shares.html
HenrikBengtsson commented 3 years ago

@jlbl,

  1. Does the table on https://wynton.ucsf.edu/hpc/about/gpu-shares.html look correct?

  2. The info "Wynton HPC has 38 GPU nodes with a total of 132 GPUs available to all users. Among these, 31 GPU nodes, with a total of 108 GPUs, ..." in the top banner is manually updated. What's the latest stats?

HenrikBengtsson commented 1 year ago

@jlbl, @ellestad, @murashka-ucsf, does the table on https://wynton.ucsf.edu/hpc/about/gpu-shares.html look correct to you?

If I get one okay, I'll go ahead and link to this page.

jlbl commented 1 year ago

The column labeled "GPU slots" really translates to "GPU nodes". Some of those nodes have 2 GPUs, some have 4, and some have 8. Adding insult to injury, there are a few cases where multiple labs share a node, each with access to a subset of the GPUs in that node. So, unfortunately, the picture isn't quite complete.

HenrikBengtsson commented 1 year ago

Updated column names to #GPU Nodes and GPU Nodes.

I just realized that the last two entries appear to be duplicated. This is because we how have two GPU queues. I'll update the tools that generate the raw data to add a Queue column.

HenrikBengtsson commented 1 year ago

Added 'Queue' column to https://wynton.ucsf.edu/hpc/about/gpu-shares.html. The number of unique GPU nodes calculated from the GPU shares matches the number of contributed GPU nodes reported on https://wynton.ucsf.edu/hpc/about/gpus.html. That gives evidence that both sources of data are correct.

HenrikBengtsson commented 1 year ago

After my most recent updates, are we ready to ship https://wynton.ucsf.edu/hpc/about/gpu-shares.html?

HenrikBengtsson commented 1 year ago

Friendly reminder: Is it okay, if I make this content/page public?

jlbl commented 1 year ago

The page currently looks broken, or am I missing something?

HenrikBengtsson commented 1 year ago

Hmm, looks like the https://github.com/ucsf-wynton/wynton-website-hpc/blob/master/docs/hpc/assets/data/gpu_shares.tsv file is empty. I'll investigate the cronjob...

HenrikBengtsson commented 1 year ago

Fixed (commit bf02d9f21); the tool that queried qconf had been moved. script updated to give an error if it happens again, instead of writing empty output.

Please check https://wynton.ucsf.edu/hpc/about/gpu-shares.html again

jlbl commented 1 year ago

4gpu.q is confusing things. 7 of the 8 nodes listed under 4gpu.q for MSG are also listed in their gpu.q share (inflating the total node count).

HenrikBengtsson commented 1 year ago

We could just drop the "46 GPU nodes" count

jlbl commented 1 year ago

But those nodes aren't available in gpu.q, so they shouldn't be listed there.

HenrikBengtsson commented 1 year ago

Sorry, ELI5

jlbl commented 1 year ago

Sorry - poor phrasing. gpu.q is not available on those hosts, only 4gpu.q. E.g.:

$ qhost -q -h msg-iogpu11
HOSTNAME                ARCH         NCPU NSOC NCOR NTHR  LOAD  MEMTOT  MEMUSE  SWAPTO  SWAPUS
----------------------------------------------------------------------------------------------
global                  -               -    -    -    -     -       -       -       -       -
msg-iogpu11             lx-amd64       32    2   16   32 12.22  251.6G   11.6G    4.0G  125.3M
   4gpu.q               BP    0/1/1         
   test.q               BIP   0/0/2         
   long.q               BP    0/0/16        
   short.q              BP    0/0/16        

So to have those nodes listed in the line for MSG gpu.q hosts is incorrect.

HenrikBengtsson commented 1 year ago

So to have those nodes listed in the line for MSG gpu.q hosts is incorrect.

Got it. So the raw data (https://github.com/ucsf-wynton/wynton-website-hpc/blob/master/docs/hpc/assets/data/gpu_shares.tsv) is incorrect. I thought the presentation should be improved. I'll go back to the drawing board for generating those data.

HenrikBengtsson commented 1 year ago

Could this be a misconfiguration in the SGE @msggpunodes host set? Because ...

$ SGE_SINGLE_LINE=true qconf -srqs shared_gpu_limits
{
   name         shared_gpu_limits
   description  "Limits on use of owned GPUs by non-owners"
   enabled      TRUE
   limit        projects {!rosenberglab} queues gpu.q hosts {@rosenberglabgpunodes} to h_rt=7200
   limit        projects {!agardlab,!chenglab,!frostlab,!stroudlab,!gpu-collab} queues gpu.q hosts {@msggpunodes} to h_rt=7200
   limit        projects {!agardlab,!chenglab,!frostlab,!stroudlab,!gpu-collab} queues 4gpu.q hosts {@msg4gpunodes} to h_rt=7200
   limit        projects {!grabelab} queues gpu.q hosts {@grabelabgpunodes} to h_rt=7200
   limit        projects {!genetics,!arnaoutlab} queues gpu.q hosts qb3-iogpu1 to h_rt=7200
   limit        projects {!huanglab} queues gpu.q hosts {@huanglabgpunodes} to h_rt=7200
   limit        projects {!jacobsonlab,!arnaoutlab} queues gpu.q hosts qb3-idgpu6 to h_rt=7200
   limit        projects {!gladstone,!pollardlab} queues gpu.q hosts {@gladstonegpunodes} to h_rt=7200
   limit        projects {!i4h} queues gpu.q hosts {@i4hgpunodes} to h_rt=7200
   limit        projects {!keiserlab} queues gpu.q hosts qb3-atgpu16 to h_rt=7200
   limit        projects {!ichs} queues gpu.q hosts qb3-atgpu19 to h_rt=7200
   limit        projects {!theodorislab} queues gpu.q hosts qb3-atgpu20 to h_rt=7200
   limit        projects {!neuroppg} queues gpu.q hosts qb3-atgpu21 to h_rt=7200
   limit        projects {!gidb} queues gpu.q hosts qb3-atgpu22 to h_rt=7200
   limit        projects {!rsl} queues gpu.q hosts qb3-atgpu23 to h_rt=7200
}

Focusing on:

   limit        projects {!agardlab,!chenglab,!frostlab,!stroudlab,!gpu-collab} queues gpu.q hosts {@msggpunodes} to h_rt=7200
   limit        projects {!agardlab,!chenglab,!frostlab,!stroudlab,!gpu-collab} queues 4gpu.q hosts {@msg4gpunodes} to h_rt=7200

I get:

$ qconf -shgrp "@msggpunodes"
group_name @msggpunodes
hostlist msg-iogpu1 msg-iogpu2 msg-iogpu3 msg-iogpu4 msg-iogpu5 msg-iogpu6 \
         msg-iogpu7 msg-iogpu8 msg-iogpu9 msg-iogpu11 msg-iogpu12 msg-iogpu13 \
         msg-ihgpu1 msg-ihgpu2 msg-ihgpu3 msg-ihgpu4 msg-ihgpu5 qb3-idgpu3 \
         qb3-idgpu4 qb3-idgpu5 qb3-idgpu11 qb3-idgpu12
$ qconf -shgrp "@msg4gpunodes"
group_name @msg4gpunodes
hostlist qb3-iogpu5 msg-iogpu11 msg-iogpu12 msg-iogpu13 qb3-idgpu3 qb3-idgpu4 \
         qb3-idgpu5 qb3-atgpu18

Note how that msg-iogpu11, msg-iogpu12, and msg-iogpu13 (and a few others) are in both sets.

HenrikBengtsson commented 1 year ago

A friendly bump on this; I think SGE is misconfigured, preventing us from pulling this information automatically.

Nicki-Martin commented 1 year ago

@jlbl is this a quick fix that could be addressed in the near future? We're not posting a link to this contributed GPUs page until the table is accurate.