rackslab / Slurm-web

Open source web dashboard for Slurm HPC clusters
https://slurm-web.com
GNU General Public License v3.0
317 stars 89 forks source link

dashboard/js/draw/2d-draw.js not drawing correctly when multiple jobs running on a single node. #210

Closed bviviano closed 3 months ago

bviviano commented 4 years ago

Something changed in 2d-draw.js between 2.2.2 and 2.2.5 (I am trying to issolate it now) related to multiple job running on a single node.

In 2.2.2, when multiple jobs where running on a single node, each square representing a used core got a different color. In 2.2.5, only the last core/job gets a color, the other used cores show as if they are not allocated.

I am attaching pictures from my interface Slurm Web 2.2.2 Slurm-Web-2 2 2_Node

Slurm Web 2.2.5 Slurm-Web-2 2 5_Node

These are from the same running HTTPd instance, same node captured via screen shot, just different install directories for 2.2.2 vs. 2.2.5.

I've isolated the issue to the drawCores function in 2d-draw.js. I am working through the code to try and understand why it no longer is drawing a expected, but thought I'd open this ticket in case there is something else I am missing.

BSCrumpton commented 4 years ago

I'm noticing this even with only a single job on a node:

Slurm Web 2.2.2 image

Slurm Web 2.2.5 image

There are 3 changes that are the potential culprits: image

BSCrumpton commented 4 years ago

The Major code changes between https://github.com/edf-hpc/slurm-web/commit/c233323e514fb41e9d2490a69212378130c0712b and https://github.com/edf-hpc/slurm-web/commit/c1d1ad2f41efc365ad3bfea6fd380c15c3033d8d Is when the error was introduced. reverting only the 2d-draw.js file back to https://github.com/edf-hpc/slurm-web/blob/86ac1c6d1b3873de80320ec06bc2fed6bdb26e0c/dashboard/js/draw/2d-draw.js has it working mostly like normal, but there are still some errors with the layout.

BSCrumpton commented 4 years ago

I think I've tracked an issue down in 2.2.5 to the getCoreABSCoordinates function. I noticed that certain times, I'm having multiple cores map to the same X,Y pairs. Looking at doing a fix sometime soon, if I can figure it out.

BSCrumpton commented 4 years ago

so; 2 major issues I've found:

  1. in the /util/jobs.js buildAllocatedCPUs function, you will only end up getting one layout returned for a node, regardless of how many jobs (and therefore how many layouts) should be returned. This is because it overwrites the value for the 'layout' key every job. When this gets fixed, it'll necessitate the update of the drawCores function in /draw/2d-draw.js as it is only expecting a single layout.
  2. Layouts are only half of the allocated cores. This is most likely due to hyper-threading. As an example- for 2 jobs allocated to a node, they have layout [2, 3, 4, 5, 6, 7, 8] with 14 allocated cores, and layout [0,1] with 4 allocated cores respectively. in this case; I think it actually makes sense to revert back to just doing drawCores by the allocatedCPUs instead of the layout. Note that you could do the layout drawing, but would require working a little magic to properly show the cores used.
bviviano commented 3 years ago

Just checked 2.2.6 and it has the same issue w/Slurm 19.05.8. Any idea if there is a way to fix it?

BSCrumpton commented 3 years ago

I ended up re-writing the functions to match old functionality, partially to add support for a GPUs page. https://github.com/BSCrumpton/slurm-web/tree/GPUBranch some of the relevant code can be seen here

bviviano commented 3 years ago

So just replace jobs.js and 2d-draw.js from your repo or do you have README someplace with additional instructions.

Thanks.

BSCrumpton commented 3 years ago

honestly, maybe just dashboard/js/draw/2d-draw.js . Note that I haven't tested this in a while, so I'm not 100% sure. No other readme- but I should add it to the docket in the future :joy:

bviviano commented 3 years ago

I replaced the draw/2d-draw.js from the tagged 2.2.6 branch with the one for your repo and its now drawing correctly. I think the 3d draw might still be off, but no one really uses that, except for a demo and then no one cares about the cores.

Any pointers as to what the GPU changes you made do and how to incorporate them.

Wanted to edit to add you do need utils/jobs.js as well or the node count drawing gets off due to a math error.

BSCrumpton commented 3 years ago

image

basically- I added another tab to the main menu (top right) called GPUs that displays similar to JobsMap, but showing GPUs instead of cores.

Additionally, in the main Jobs tab, image resources now show GPUs. Note that this functionality is entirely dependent on your slurm/pyslurm version. I'm using the TRES fields to get # free and allocated GPUS, and older slurm versions don't support that field at all.

bviviano commented 3 years ago

Thanks for the screenshot, that makes it clear what you're doing. I only have 4 GPGPUs on my cluster, across 200 nodes / 5 racks, so it wouldn't really matter too much to my users right now, but its a nice extension.

SullivanJia commented 2 years ago

there are two problems in dashboard/js/draw/2d-draw.js and dashboard/js/utils/jobs.js. And I fixed this problems such like this. image Later I will push my local code to solve this problem.

rezib commented 4 months ago

This issue concerns Slurm-web v2 which is not maintained anymore. You are highly encouraged to test the new version v3.0.0. The quick start guide for v3.0.0 is available online: https://docs.rackslab.io/slurm-web/install/quickstart.html

Unless someone is motivated to maintain the old version of Slurm-web or you have a justified reason to keep this issue open, it will be closed in a few weeks.

rezib commented 3 months ago

For the reasons explained in the previous comment, I finally close this issue.