Crash when node names are too long

MK4H commented 3 years ago

If node names are over 20 characters long, the output of sinfo -h -N -O "NodeList,AllocMem,Memory,CPUsState,StateLong", used at node.go:85, looks like this:

cpu-always-on-st-t30                   1                   0/2/0/2             idle                
cpu-spot-dy-c52xlar0                   1                   0/8/0/8             idle~               
cpu-spot-dy-c52xlar0                   1                   0/8/0/8             idle~

You can see that node name and memory are not separated by whitespace.

This results in a crash with the following output:

prometheus-slurm-exporter[5783]: panic: runtime error: index out of range [4] with length 4
prometheus-slurm-exporter[5783]: goroutine 9 [running]:
prometheus-slurm-exporter[5783]: main.ParseNodeMetrics(0xc00016e000, 0x5eb, 0xe00, 0xc0000b10d8)
prometheus-slurm-exporter[5783]: #011/home/ubuntu/aws-parallelcluster-monitoring/prometheus-slurm-exporter/node.go:56 +0x6cf
prometheus-slurm-exporter[5783]: main.NodeGetMetrics(0x8b7f20)
prometheus-slurm-exporter[5783]: #011/home/ubuntu/aws-parallelcluster-monitoring/prometheus-slurm-exporter/node.go:40 +0x2a
prometheus-slurm-exporter[5783]: main.(*NodeCollector).Collect(0xc00007a000, 0xc0000b1080)
prometheus-slurm-exporter[5783]: #011/home/ubuntu/aws-parallelcluster-monitoring/prometheus-slurm-exporter/node.go:128 +0x37
prometheus-slurm-exporter[5783]: github.com/prometheus/client_golang/prometheus.(*Registry).Gather.func1()
prometheus-slurm-exporter[5783]: #011/home/ubuntu/aws-parallelcluster-monitoring/prometheus-slurm-exporter/go/modules/pkg/mod/github.com/prometheus/client_golang@v1.2.1/prometheus/registry.go:443 +0x1a2
prometheus-slurm-exporter[5783]: created by github.com/prometheus/client_golang/prometheus.(*Registry).Gather
prometheus-slurm-exporter[5783]: #011/home/ubuntu/aws-parallelcluster-monitoring/prometheus-slurm-exporter/go/modules/pkg/mod/github.com/prometheus/client_golang@v1.2.1/prometheus/registry.go:535 +0xe8e
systemd[1]: slurm_exporter.service: Main process exited, code=exited, status=2/INVALIDARGUMENT
systemd[1]: slurm_exporter.service: Failed with result 'exit-code'.

It expects 5 fields separated by whatespace, but finds only 4 which results in out-of-bounds array access and panic.

Possible fix is to change sinfo -h -N -O "NodeList,AllocMem,Memory,CPUsState,StateLong" to sinfo -h -N -O "NodeList: ,AllocMem: ,Memory: ,CPUsState: ,StateLong: ", explicitly telling SLURM to append a space after each value.

mtds commented 3 years ago

On our current infrastructure it is not possible to reproduce this problem, given the fact that we use hostnames with less than 10 characters.

On Slurm 18.08.8 (CentOS 7.8), I have got quite a different output.

If I use the command currently in node.go I got the following (shortened for clarity):

[...]
node1032            main              189200              191762              86/10/0/96          mixed
node1033            main              189200              191762              86/10/0/96          mixed

With the proposed modification to the syntax, I have got instead:

[...]
node1032main18920019176286/10/0/96mixed
node1033main18920019176286/10/0/96mixed

The second output will most likely crash node.go, considering that to extract the numbers we assume that some spaces are present between each value (node name, partition, etc.).

Which version of Slurm are you using? From the crash log you have posted and the hostnames you are using, I assume you are running a cluster on AWS but we definitely do not have operational experience with that environment.

MK4H commented 3 years ago

We are using AWS ParallelCluster 2.10.0, running on Ubuntu 18.04 using Slurm version 20.02.4. With this version, the output of sinfo -h -N -O "NodeList: ,AllocMem: ,Memory: ,CPUsState: ,StateLong: " is

cpu-always-on-st-t3amedium-1 0 1 0/2/0/2 idle 
cpu-always-on-st-t3amedium-2 0 1 0/2/0/2 idle 
cpu-always-on-st-t3amedium-3 0 1 0/2/0/2 idle 
cpu-spot-dy-c52xlarge-1 0 1 0/8/0/8 idle~ 
cpu-spot-dy-c52xlarge-2 0 1 0/8/0/8 idle~

So I guess there was a change between 18.08.8 and 20.02.4 that changes the interface and output of sinfo.

Edit: Found it: https://github.com/SchedMD/slurm/commit/9ea6c9468b763dd742f81a4e1ab43d47f0950501

mtds commented 3 years ago

We have faced this problem in the past. Since this exporter is basically parsing the output of sinfo, squeue, sdiag, etc. it is very sensible to the output format, which from time to time is changed by the SchedMD developers.

The node.sh module is particularly prone to this problem. In the past, I have implemented a workaround using regular expressions (e.g. nodes.go), but so far I did not have the chance to do the same with node.sh.

sjpb commented 2 years ago

Recent versions of sinfo have --json although that won't be an option for everyone. Does this need fields which are not supported by -o|--format? That says that if size isn't provided it automatically picks one long enough. As opposed to -O|--Format which defaults to 20 chars if size is not provided.

vpenso / prometheus-slurm-exporter

Crash when node names are too long #55