Node showing incorrect GPU usage ("NAN", s/b 1)

wdennis commented 2 months ago

I have allocated a GPU on a node as so:

wdennis@test-slurm-js:~$ srun --pty -n 1 -t 24:00:00 --partition=debug --gres=gpu:1 --mem=12GB bash -i
wdennis@test-slurm-n01:~$ squeue
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
                 4     debug     bash  wdennis  R       0:04      1 test-slurm-n01
wdennis@test-slurm-n01:~$
wdennis@test-slurm-n01:~$ nvidia-smi
Wed Aug 21 20:33:16 2024
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 555.42.02              Driver Version: 555.42.02      CUDA Version: 12.5     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce GTX 1080 Ti     Off |   00000000:02:00.0 Off |                  N/A |
| 23%   23C    P8              8W /  250W |       2MiB /  11264MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+
wdennis@test-slurm-n01:~$ ls
Backups  bin  Desktop  matrixMul
wdennis@test-slurm-n01:~$ ./matrixMul
[Matrix Multiply Using CUDA] - Starting...
GPU Device 0: "NVIDIA GeForce GTX 1080 Ti" with compute capability 6.1

Allocating 30
Allocating 30720
Allocating 30720
Allocating 30720
Err: 0
Err: 0
MatrixA(10000,20000), MatrixB(20000,10000)

But, if I mouse over the node card and get the popup, I am seeing GPUS (USED): as "NAN", but should be "1" (Also, shouldn't the "NAN" be "0" for both GPUs and Shards?) db-screen-node-gpu-wrong

thediymaker commented 2 months ago

Can you send me the output from the slurm API for this job? This will let me see how the data is being exported, and why it might not be showing correctly on the dashboard. based on what I am seeing here, do you have 4x 1080ti's each split x4 using vgpu or something similar?

wdennis commented 2 months ago

The GPUs are not split... There are 4 1080Ti's in each GPU box, and I have provisioned GRES as so: Gres=gpu:4,shard:16

Here is the output from /job/6 -- test-cluster_jobid_6_from_rest.txt

thediymaker commented 2 months ago

Thanks for sending this over, can you also send the output from /api/slurm/nodes while the job is running?

wdennis commented 2 months ago

Here you go.

test-cluster_nodes_from_rest.txt

thediymaker commented 2 months ago

I think I have something that should work, if you have a chance, could you send me over an example from /api/slurm/nodes with a gpu slice or multiple, being used? It looks like in the example you sent over, there are slices available, but none being used correct? I had never actually used slices, so this is a great, I appreciate it!

wdennis commented 2 months ago

Here is a nodes example, using 4 shards on one node (test-slurm-n01) - they all landed on GPU instance 0. I may modify Slurm to distribute the shards over the GPUs (since "shard" is just an "access token" for sharing an entire GPU, it is of no real use to land more than one shard per job on a single GPU - they are not "slices" or "shares" of the GPU, per SchedMD.)

test-cluster_nodes_from_rest-using_shards.txt

thediymaker commented 2 months ago

See if this is showing up correctly for you now, I also made some changes in general to the cards (since shards + the GPU icons really wouldnt mix) also cleaned up the hover card. Let me know how it looks.

wdennis commented 2 months ago

After pulling the changes, I know I’d need a stop/start in pm2, but do I need to “next build” in the middle?

On Fri, Aug 23, 2024 at 1:07 AM Johnathan Lee @.***> wrote:

See if this is showing up correctly for you now, I also made some changes in general to the cards (since shards + the GPU icons really wouldnt mix) also cleaned up the hover card. Let me know how it looks.

— Reply to this email directly, view it on GitHub https://github.com/thediymaker/slurm-node-dashboard/issues/61#issuecomment-2306301541, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABNDWNMVP4BQ2KMCMLOCALLZS27RPAVCNFSM6AAAAABM5CII3CVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDGMBWGMYDCNJUGE . You are receiving this because you authored the thread.Message ID: @.***>

thediymaker commented 2 months ago

You shouldn't have to for these smaller updates, if any of the updates include npm installs, you would. I always do a "git pull", "npm i", and then a "pm2 restart 0" (or what ever id for the correct dashboard) when I pull down changes on our production dashboards.

wdennis commented 2 months ago

Running into a build problem here, please advise... snd_next_build_error.txt

Did you add new module @radix-ui/react-tooltip?

thediymaker commented 2 months ago

Give that a go, I had been cleaning up some code and thought i had fully removed it.

Thanks!

wdennis commented 2 months ago

OK, works now, thanks!

However, I have some issue with the card info about GPU utilization, and also the top utilization metric "dials":

with whole GPU GRES requested, I see the top dial having a correct % metric, but the card does not reflect the use of whole GPUs. Instead of the "GPU SHARD: 0/16 ..." line, I believe it should say "GPU (WHOLE): 1/4 ..." and reflect the correct % used.
with shard GPU GRES requested, the top dial shows "0%" use, but this I believe should be the same as the whole GPU percentage; a GPU GRES instance can only be used as "whole" or "shared" at a given time; anotherwords, if even one shard lands on GPU instance #0, that GPU is then in "shared" mode, and cannot be allocated for "whole" GPU GRES use (--gres=gpu:1 for example), so I believe the GPU util should be reflected the same as whole, if even one shard lands on that GPU.) Remember that "shard" is an access token for the entire GPU resource, it does not fractionalize the GPU like MIG does. So I would also substitute the word "TOKENS" for "SLICES" in the card verbiage.

To make things even more complicated, one can have both whole GPU GRES and shard GPU GRES on the same node at the same time (resources permitting) - let me know if you want to see an example of that.

thediymaker commented 2 months ago

This is great, yeah, any examples you can send would be greatly appreciated! We only use MIG and standard GPU allocations, so getting some good examples with shards, and how they might be used on a system is really helpful!

wdennis commented 2 months ago

Here is an API example of a node with both whole GPUs allocated, as well as shards

(Can search for the string "<<<<<" to find the line)

test-cluster_nodes_from_rest-using_shard+whole.txt

thediymaker commented 2 months ago

Ok, so I spent some time on this, i think the issue was with how shards are set up for the nodes, I believe it should be including the GPU type, where as right now its null

gpu:(null):2(IDX:2-3),shard:(null):2(2/4,0/4,0/4,0/4)

should be something like

gpu:1080ti:4(S:0-1),shard:1080ti:16(S:0-1)

I added shards to my test node and was able to get this working with the gpu name there.

I had found these examples online,

Gres=gpu:p40:6(S:0),gpu:rtx:2(S:0),shard:p40:36(S:0),shard:rtx:12(S:0)

In your slurm.conf, or gres file, you should be able to give these a descriptor, which then should fix how its being displayed.

Let me know if you get a chance to look at this, and if everything is still working else wise as expected!

Thanks!

thediymaker / slurm-node-dashboard

Node showing incorrect GPU usage ("NAN", s/b 1) #61