Open HenrikBengtsson opened 3 years ago
Added:
wynton gpushares
to wynton-tools@jlbl,
Does the table on https://wynton.ucsf.edu/hpc/about/gpu-shares.html look correct?
The info "Wynton HPC has 38 GPU nodes with a total of 132 GPUs available to all users. Among these, 31 GPU nodes, with a total of 108 GPUs, ..." in the top banner is manually updated. What's the latest stats?
@jlbl, @ellestad, @murashka-ucsf, does the table on https://wynton.ucsf.edu/hpc/about/gpu-shares.html look correct to you?
If I get one okay, I'll go ahead and link to this page.
The column labeled "GPU slots" really translates to "GPU nodes". Some of those nodes have 2 GPUs, some have 4, and some have 8. Adding insult to injury, there are a few cases where multiple labs share a node, each with access to a subset of the GPUs in that node. So, unfortunately, the picture isn't quite complete.
Updated column names to #GPU Nodes
and GPU Nodes
.
I just realized that the last two entries appear to be duplicated. This is because we how have two GPU queues. I'll update the tools that generate the raw data to add a Queue
column.
Added 'Queue' column to https://wynton.ucsf.edu/hpc/about/gpu-shares.html. The number of unique GPU nodes calculated from the GPU shares matches the number of contributed GPU nodes reported on https://wynton.ucsf.edu/hpc/about/gpus.html. That gives evidence that both sources of data are correct.
After my most recent updates, are we ready to ship https://wynton.ucsf.edu/hpc/about/gpu-shares.html?
Friendly reminder: Is it okay, if I make this content/page public?
The page currently looks broken, or am I missing something?
Hmm, looks like the https://github.com/ucsf-wynton/wynton-website-hpc/blob/master/docs/hpc/assets/data/gpu_shares.tsv file is empty. I'll investigate the cronjob...
Fixed (commit bf02d9f21); the tool that queried qconf
had been moved. script updated to give an error if it happens again, instead of writing empty output.
Please check https://wynton.ucsf.edu/hpc/about/gpu-shares.html again
4gpu.q is confusing things. 7 of the 8 nodes listed under 4gpu.q for MSG are also listed in their gpu.q share (inflating the total node count).
We could just drop the "46 GPU nodes" count
But those nodes aren't available in gpu.q, so they shouldn't be listed there.
Sorry, ELI5
Sorry - poor phrasing. gpu.q is not available on those hosts, only 4gpu.q. E.g.:
$ qhost -q -h msg-iogpu11
HOSTNAME ARCH NCPU NSOC NCOR NTHR LOAD MEMTOT MEMUSE SWAPTO SWAPUS
----------------------------------------------------------------------------------------------
global - - - - - - - - - -
msg-iogpu11 lx-amd64 32 2 16 32 12.22 251.6G 11.6G 4.0G 125.3M
4gpu.q BP 0/1/1
test.q BIP 0/0/2
long.q BP 0/0/16
short.q BP 0/0/16
So to have those nodes listed in the line for MSG gpu.q hosts is incorrect.
So to have those nodes listed in the line for MSG gpu.q hosts is incorrect.
Got it. So the raw data (https://github.com/ucsf-wynton/wynton-website-hpc/blob/master/docs/hpc/assets/data/gpu_shares.tsv) is incorrect. I thought the presentation should be improved. I'll go back to the drawing board for generating those data.
Could this be a misconfiguration in the SGE @msggpunodes
host set? Because ...
$ SGE_SINGLE_LINE=true qconf -srqs shared_gpu_limits
{
name shared_gpu_limits
description "Limits on use of owned GPUs by non-owners"
enabled TRUE
limit projects {!rosenberglab} queues gpu.q hosts {@rosenberglabgpunodes} to h_rt=7200
limit projects {!agardlab,!chenglab,!frostlab,!stroudlab,!gpu-collab} queues gpu.q hosts {@msggpunodes} to h_rt=7200
limit projects {!agardlab,!chenglab,!frostlab,!stroudlab,!gpu-collab} queues 4gpu.q hosts {@msg4gpunodes} to h_rt=7200
limit projects {!grabelab} queues gpu.q hosts {@grabelabgpunodes} to h_rt=7200
limit projects {!genetics,!arnaoutlab} queues gpu.q hosts qb3-iogpu1 to h_rt=7200
limit projects {!huanglab} queues gpu.q hosts {@huanglabgpunodes} to h_rt=7200
limit projects {!jacobsonlab,!arnaoutlab} queues gpu.q hosts qb3-idgpu6 to h_rt=7200
limit projects {!gladstone,!pollardlab} queues gpu.q hosts {@gladstonegpunodes} to h_rt=7200
limit projects {!i4h} queues gpu.q hosts {@i4hgpunodes} to h_rt=7200
limit projects {!keiserlab} queues gpu.q hosts qb3-atgpu16 to h_rt=7200
limit projects {!ichs} queues gpu.q hosts qb3-atgpu19 to h_rt=7200
limit projects {!theodorislab} queues gpu.q hosts qb3-atgpu20 to h_rt=7200
limit projects {!neuroppg} queues gpu.q hosts qb3-atgpu21 to h_rt=7200
limit projects {!gidb} queues gpu.q hosts qb3-atgpu22 to h_rt=7200
limit projects {!rsl} queues gpu.q hosts qb3-atgpu23 to h_rt=7200
}
Focusing on:
limit projects {!agardlab,!chenglab,!frostlab,!stroudlab,!gpu-collab} queues gpu.q hosts {@msggpunodes} to h_rt=7200
limit projects {!agardlab,!chenglab,!frostlab,!stroudlab,!gpu-collab} queues 4gpu.q hosts {@msg4gpunodes} to h_rt=7200
I get:
$ qconf -shgrp "@msggpunodes"
group_name @msggpunodes
hostlist msg-iogpu1 msg-iogpu2 msg-iogpu3 msg-iogpu4 msg-iogpu5 msg-iogpu6 \
msg-iogpu7 msg-iogpu8 msg-iogpu9 msg-iogpu11 msg-iogpu12 msg-iogpu13 \
msg-ihgpu1 msg-ihgpu2 msg-ihgpu3 msg-ihgpu4 msg-ihgpu5 qb3-idgpu3 \
qb3-idgpu4 qb3-idgpu5 qb3-idgpu11 qb3-idgpu12
$ qconf -shgrp "@msg4gpunodes"
group_name @msg4gpunodes
hostlist qb3-iogpu5 msg-iogpu11 msg-iogpu12 msg-iogpu13 qb3-idgpu3 qb3-idgpu4 \
qb3-idgpu5 qb3-atgpu18
Note how that msg-iogpu11
, msg-iogpu12
, and msg-iogpu13
(and a few others) are in both sets.
A friendly bump on this; I think SGE is misconfigured, preventing us from pulling this information automatically.
@jlbl is this a quick fix that could be addressed in the near future? We're not posting a link to this contributed GPUs page until the table is accurate.
The information on GPU shares can be found via
For example,
tells SGE that "Except for members part of lab
genetics
orarnaoutlab
should have their jobs limited to 2 hours (=7200 seconds) on hostqb3-iogpu1
.Another example is:
tells SGE that "Except for members part of lab
rosenberglab
should have their jobs limited to 2 hours (=7200 seconds) on hosts part of the@rosenberglabgpunodes
hostgroup. To see which these compute nodes are, call:To see what hostgroups exits;
Appendix
From
man qconf
:From
man sge_resource_quota
: