Open wdennis opened 2 months ago
Can you send me the output from the slurm API for this job? This will let me see how the data is being exported, and why it might not be showing correctly on the dashboard. based on what I am seeing here, do you have 4x 1080ti's each split x4 using vgpu or something similar?
The GPUs are not split... There are 4 1080Ti's in each GPU box, and I have provisioned GRES as so: Gres=gpu:4,shard:16
Here is the output from /job/6 -- test-cluster_jobid_6_from_rest.txt
Thanks for sending this over, can you also send the output from /api/slurm/nodes while the job is running?
Here you go.
I think I have something that should work, if you have a chance, could you send me over an example from /api/slurm/nodes with a gpu slice or multiple, being used? It looks like in the example you sent over, there are slices available, but none being used correct? I had never actually used slices, so this is a great, I appreciate it!
Here is a nodes example, using 4 shards on one node (test-slurm-n01) - they all landed on GPU instance 0. I may modify Slurm to distribute the shards over the GPUs (since "shard" is just an "access token" for sharing an entire GPU, it is of no real use to land more than one shard per job on a single GPU - they are not "slices" or "shares" of the GPU, per SchedMD.)
See if this is showing up correctly for you now, I also made some changes in general to the cards (since shards + the GPU icons really wouldnt mix) also cleaned up the hover card. Let me know how it looks.
After pulling the changes, I know I’d need a stop/start in pm2, but do I need to “next build” in the middle?
On Fri, Aug 23, 2024 at 1:07 AM Johnathan Lee @.***> wrote:
See if this is showing up correctly for you now, I also made some changes in general to the cards (since shards + the GPU icons really wouldnt mix) also cleaned up the hover card. Let me know how it looks.
— Reply to this email directly, view it on GitHub https://github.com/thediymaker/slurm-node-dashboard/issues/61#issuecomment-2306301541, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABNDWNMVP4BQ2KMCMLOCALLZS27RPAVCNFSM6AAAAABM5CII3CVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDGMBWGMYDCNJUGE . You are receiving this because you authored the thread.Message ID: @.***>
You shouldn't have to for these smaller updates, if any of the updates include npm installs, you would. I always do a "git pull", "npm i", and then a "pm2 restart 0" (or what ever id for the correct dashboard) when I pull down changes on our production dashboards.
Running into a build problem here, please advise... snd_next_build_error.txt
Did you add new module @radix-ui/react-tooltip
?
Give that a go, I had been cleaning up some code and thought i had fully removed it.
Thanks!
OK, works now, thanks!
However, I have some issue with the card info about GPU utilization, and also the top utilization metric "dials":
--gres=gpu:1
for example), so I believe the GPU util should be reflected the same as whole, if even one shard lands on that GPU.) Remember that "shard" is an access token for the entire GPU resource, it does not fractionalize the GPU like MIG does. So I would also substitute the word "TOKENS" for "SLICES" in the card verbiage.
To make things even more complicated, one can have both whole GPU GRES and shard GPU GRES on the same node at the same time (resources permitting) - let me know if you want to see an example of that.
This is great, yeah, any examples you can send would be greatly appreciated! We only use MIG and standard GPU allocations, so getting some good examples with shards, and how they might be used on a system is really helpful!
Here is an API example of a node with both whole GPUs allocated, as well as shards
(Can search for the string "<<<<<" to find the line)
Ok, so I spent some time on this, i think the issue was with how shards are set up for the nodes, I believe it should be including the GPU type, where as right now its null
gpu:(null):2(IDX:2-3),shard:(null):2(2/4,0/4,0/4,0/4)
should be something like
gpu:1080ti:4(S:0-1),shard:1080ti:16(S:0-1)
I added shards to my test node and was able to get this working with the gpu name there.
I had found these examples online,
Gres=gpu:p40:6(S:0),gpu:rtx:2(S:0),shard:p40:36(S:0),shard:rtx:12(S:0)
In your slurm.conf, or gres file, you should be able to give these a descriptor, which then should fix how its being displayed.
Let me know if you get a chance to look at this, and if everything is still working else wise as expected!
Thanks!
I have allocated a GPU on a node as so:
But, if I mouse over the node card and get the popup, I am seeing GPUS (USED): as "NAN", but should be "1" (Also, shouldn't the "NAN" be "0" for both GPUs and Shards?)