Open MK4H opened 3 years ago
On our current infrastructure it is not possible to reproduce this problem, given the fact that we use hostnames with less than 10 characters.
On Slurm 18.08.8 (CentOS 7.8), I have got quite a different output.
node.go
I got the following (shortened for clarity):
[...]
node1032 main 189200 191762 86/10/0/96 mixed
node1033 main 189200 191762 86/10/0/96 mixed
[...]
node1032main18920019176286/10/0/96mixed
node1033main18920019176286/10/0/96mixed
The second output will most likely crash node.go
, considering that to extract the numbers we assume that some spaces are present between each value (node name, partition, etc.).
Which version of Slurm are you using? From the crash log you have posted and the hostnames you are using, I assume you are running a cluster on AWS but we definitely do not have operational experience with that environment.
We are using AWS ParallelCluster 2.10.0, running on Ubuntu 18.04 using Slurm version 20.02.4. With this version, the output of sinfo -h -N -O "NodeList: ,AllocMem: ,Memory: ,CPUsState: ,StateLong: "
is
cpu-always-on-st-t3amedium-1 0 1 0/2/0/2 idle
cpu-always-on-st-t3amedium-2 0 1 0/2/0/2 idle
cpu-always-on-st-t3amedium-3 0 1 0/2/0/2 idle
cpu-spot-dy-c52xlarge-1 0 1 0/8/0/8 idle~
cpu-spot-dy-c52xlarge-2 0 1 0/8/0/8 idle~
So I guess there was a change between 18.08.8 and 20.02.4 that changes the interface and output of sinfo.
Edit: Found it: https://github.com/SchedMD/slurm/commit/9ea6c9468b763dd742f81a4e1ab43d47f0950501
We have faced this problem in the past. Since this exporter is basically parsing the output of sinfo
, squeue
, sdiag
, etc. it is very sensible to the output format, which from time to time is changed by the SchedMD developers.
The node.sh
module is particularly prone to this problem. In the past, I have implemented a workaround using regular expressions (e.g. nodes.go), but so far I did not have the chance to do the same with node.sh
.
Recent versions of sinfo have --json
although that won't be an option for everyone. Does this need fields which are not supported by -o|--format
? That says that if size isn't provided it automatically picks one long enough. As opposed to -O|--Format
which defaults to 20 chars if size is not provided.
If node names are over 20 characters long, the output of
sinfo -h -N -O "NodeList,AllocMem,Memory,CPUsState,StateLong"
, used at node.go:85, looks like this:You can see that node name and memory are not separated by whitespace.
This results in a crash with the following output:
It expects 5 fields separated by whatespace, but finds only 4 which results in out-of-bounds array access and panic.
Possible fix is to change
sinfo -h -N -O "NodeList,AllocMem,Memory,CPUsState,StateLong"
tosinfo -h -N -O "NodeList: ,AllocMem: ,Memory: ,CPUsState: ,StateLong: "
, explicitly telling SLURM to append a space after each value.