rivosinc / prometheus-slurm-exporter

Export select slurm metrics to prometheus
Apache License 2.0
26 stars 5 forks source link

Install Fails: go install fails with error in utils.go #71

Closed codeknight03 closed 1 week ago

codeknight03 commented 2 weeks ago

When installing this via go install on an ubuntu based EC2 VM , I am facing an issue

$ go install github.com/rivosinc/prometheus-slurm-exporter@v1.5.1
# github.com/rivosinc/prometheus-slurm-exporter/exporter
go/pkg/mod/github.com/rivosinc/prometheus-slurm-exporter@v1.5.1/exporter/utils.go:175:8: cannot use 1e+12 (untyped float constant) as int value in map literal (truncated)

Here is the go version and other details about my environment

$ uname -r
5.19.0-45-generic
$ lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description:    Ubuntu 22.04.2 LTS
Release:    22.04
Codename:   jammy
$ go version
go version go1.22.0 linux/amd64
codeknight03 commented 2 weeks ago

Interestingly it works on my local WSL setup. Here are the details for my local setup

$ go version
go version go1.22.0 linux/amd64
$ uname -r
5.15.153.1-microsoft-standard-WSL2
$ lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description:    Ubuntu 22.04.4 LTS
Release:        22.04
Codename:       jammy

The only difference I see is the kernel which I am not sure is important here.

abhinavDhulipala commented 2 weeks ago

The only thing I can think of is perhaps it's using int32 instead of 64 by default?

>>> 2**32 > 1e12
False
>>> 2**64 > 1e12
True

I will push a branch with float64 typing and see if that helps :)

abhinavDhulipala commented 2 weeks ago

Please let me know if the following branch works

codeknight03 commented 2 weeks ago

The installation work but now the exporter does not

root@rgEval:~# prometheus-slurm-exporter prometheus-slurm-exporter -slurm.cli-fallback
time=2024-06-20T10:49:03.466+02:00 level=INFO msg="serving metrics at :9092/metrics"
time=2024-06-20T11:07:01.426+02:00 level=ERROR msg="squeue fallback parse error: failed on line 0 ``"
time=2024-06-20T11:41:22.325+02:00 level=ERROR msg="squeue fallback parse error: failed on line 0 ``"
time=2024-06-20T11:41:37.333+02:00 level=ERROR msg="squeue fallback parse error: failed on line 0 ``"
time=2024-06-20T11:41:52.334+02:00 level=ERROR msg="squeue fallback parse error: failed on line 0 ``"
time=2024-06-20T11:42:07.325+02:00 level=ERROR msg="squeue fallback parse error: failed on line 0 ``"
time=2024-06-20T11:42:22.326+02:00 level=ERROR msg="squeue fallback parse error: failed on line 0 ``"
time=2024-06-20T11:42:37.324+02:00 level=ERROR msg="squeue fallback parse error: failed on line 0 ``"
time=2024-06-20T11:42:52.327+02:00 level=ERROR msg="squeue fallback parse error: failed on line 0 ``"
time=2024-06-20T11:43:07.334+02:00 level=ERROR msg="squeue fallback parse error: failed on line 0 ``"
time=2024-06-20T11:43:22.335+02:00 level=ERROR msg="squeue fallback parse error: failed on line 0 ``"
time=2024-06-20T11:43:37.325+02:00 level=ERROR msg="squeue fallback parse error: failed on line 0 ``"
time=2024-06-20T11:43:52.330+02:00 level=ERROR msg="squeue fallback parse error: failed on line 0 ``"
time=2024-06-20T11:44:07.326+02:00 level=ERROR msg="squeue fallback parse error: failed on line 0 ``"
time=2024-06-20T11:44:22.328+02:00 level=ERROR msg="squeue fallback parse error: failed on line 0 ``"
time=2024-06-20T11:44:37.324+02:00 level=ERROR msg="squeue fallback parse error: failed on line 0 ``"
time=2024-06-20T11:44:52.335+02:00 level=ERROR msg="squeue fallback parse error: failed on line 0 ``"
time=2024-06-20T11:45:07.324+02:00 level=ERROR msg="squeue fallback parse error: failed on line 0 ``"
time=2024-06-20T11:45:22.325+02:00 level=ERROR msg="squeue fallback parse error: failed on line 0 ``"
time=2024-06-20T11:45:37.333+02:00 level=ERROR msg="squeue fallback parse error: failed on line 0 ``"
time=2024-06-20T11:45:52.325+02:00 level=ERROR msg="squeue fallback parse error: failed on line 0 ``"
time=2024-06-20T11:46:07.328+02:00 level=ERROR msg="squeue fallback parse error: failed on line 0 ``"
time=2024-06-20T11:46:22.326+02:00 level=ERROR msg="squeue fallback parse error: failed on line 0 ``"
time=2024-06-20T11:46:37.338+02:00 level=ERROR msg="squeue fallback parse error: failed on line 0 ``"
time=2024-06-20T11:46:52.325+02:00 level=ERROR msg="squeue fallback parse error: failed on line 0 ``"
time=2024-06-20T11:47:07.336+02:00 level=ERROR msg="squeue fallback parse error: failed on line 0 ``"
time=2024-06-20T11:47:22.323+02:00 level=ERROR msg="squeue fallback parse error: failed on line 0 ``"
time=2024-06-20T11:47:37.323+02:00 level=ERROR msg="squeue fallback parse error: failed on line 0 ``"
time=2024-06-20T11:47:52.324+02:00 level=ERROR msg="squeue fallback parse error: failed on line 0 ``"
time=2024-06-20T11:48:07.326+02:00 level=ERROR msg="squeue fallback parse error: failed on line 0 ``"
time=2024-06-20T11:48:22.325+02:00 level=ERROR msg="squeue fallback parse error: failed on line 0 ``"
time=2024-06-20T11:48:37.325+02:00 level=ERROR msg="squeue fallback parse error: failed on line 0 ``"
time=2024-06-20T11:48:52.325+02:00 level=ERROR msg="squeue fallback parse error: failed on line 0 ``"
time=2024-06-20T11:49:07.324+02:00 level=ERROR msg="squeue fallback parse error: failed on line 0 ``"
time=2024-06-20T11:49:22.330+02:00 level=ERROR msg="squeue fallback parse error: failed on line 0 ``"
time=2024-06-20T11:49:37.323+02:00 level=ERROR msg="squeue fallback parse error: failed on line 0 ``"
time=2024-06-20T11:49:52.323+02:00 level=ERROR msg="squeue fallback parse error: failed on line 0 ``"
time=2024-06-20T11:50:07.326+02:00 level=ERROR msg="squeue fallback parse error: failed on line 0 ``"
time=2024-06-20T11:50:22.325+02:00 level=ERROR msg="squeue fallback parse error: failed on line 0 ``"
time=2024-06-20T11:50:37.324+02:00 level=ERROR msg="squeue fallback parse error: failed on line 0 ``"
time=2024-06-20T11:50:52.325+02:00 level=ERROR msg="squeue fallback parse error: failed on line 0 ``"
time=2024-06-20T11:51:07.326+02:00 level=ERROR msg="squeue fallback parse error: failed on line 0 ``"
time=2024-06-20T11:51:22.325+02:00 level=ERROR msg="squeue fallback parse error: failed on line 0 ``"
time=2024-06-20T11:51:37.326+02:00 level=ERROR msg="squeue fallback parse error: failed on line 0 ``"
time=2024-06-20T11:51:52.327+02:00 level=ERROR msg="squeue fallback parse error: failed on line 0 ``"
time=2024-06-20T11:52:07.326+02:00 level=ERROR msg="squeue fallback parse error: failed on line 0 ``"
time=2024-06-20T11:52:22.327+02:00 level=ERROR msg="squeue fallback parse error: failed on line 0 ``"
time=2024-06-20T11:52:37.341+02:00 level=ERROR msg="squeue fallback parse error: failed on line 0 ``"
time=2024-06-20T11:52:52.324+02:00 level=ERROR msg="squeue fallback parse error: failed on line 0 ``"
time=2024-06-20T11:53:07.325+02:00 level=ERROR msg="squeue fallback parse error: failed on line 0 ``"
time=2024-06-20T11:53:22.327+02:00 level=ERROR msg="squeue fallback parse error: failed on line 0 ``"
time=2024-06-20T11:53:37.327+02:00 level=ERROR msg="squeue fallback parse error: failed on line 0 ``"
time=2024-06-20T11:53:52.323+02:00 level=ERROR msg="squeue fallback parse error: failed on line 0 ``"
time=2024-06-20T11:54:07.337+02:00 level=ERROR msg="squeue fallback parse error: failed on line 0 ``"
time=2024-06-20T11:54:22.324+02:00 level=ERROR msg="squeue fallback parse error: failed on line 0 ``"
time=2024-06-20T11:54:37.341+02:00 level=ERROR msg="squeue fallback parse error: failed on line 0 ``"
time=2024-06-20T11:54:52.329+02:00 level=ERROR msg="squeue fallback parse error: failed on line 0 ``"
time=2024-06-20T11:55:07.333+02:00 level=ERROR msg="squeue fallback parse error: failed on line 0 ``"
time=2024-06-20T11:55:22.328+02:00 level=ERROR msg="squeue fallback parse error: failed on line 0 ``"
time=2024-06-20T11:55:37.325+02:00 level=ERROR msg="squeue fallback parse error: failed on line 0 ``"
time=2024-06-20T11:55:52.324+02:00 level=ERROR msg="squeue fallback parse error: failed on line 0 ``"
time=2024-06-20T11:56:07.331+02:00 level=ERROR msg="squeue fallback parse error: failed on line 0 ``"
time=2024-06-20T11:56:22.328+02:00 level=ERROR msg="squeue fallback parse error: failed on line 0 ``"
time=2024-06-20T11:56:37.326+02:00 level=ERROR msg="squeue fallback parse error: failed on line 0 ``"
time=2024-06-20T11:56:52.328+02:00 level=ERROR msg="squeue fallback parse error: failed on line 0 ``"
time=2024-06-20T11:57:07.331+02:00 level=ERROR msg="squeue fallback parse error: failed on line 0 ``"
time=2024-06-20T11:57:22.326+02:00 level=ERROR msg="squeue fallback parse error: failed on line 0 ``"
time=2024-06-20T11:57:37.331+02:00 level=ERROR msg="squeue fallback parse error: failed on line 0 ``"
time=2024-06-20T11:57:52.327+02:00 level=ERROR msg="squeue fallback parse error: failed on line 0 ``"
time=2024-06-20T11:58:07.332+02:00 level=ERROR msg="squeue fallback parse error: failed on line 0 ``"
time=2024-06-20T11:58:22.326+02:00 level=ERROR msg="squeue fallback parse error: failed on line 0 ``"
time=2024-06-20T11:58:37.324+02:00 level=ERROR msg="squeue fallback parse error: failed on line 0 ``"
time=2024-06-20T11:58:52.324+02:00 level=ERROR msg="squeue fallback parse error: failed on line 0 ``"
time=2024-06-20T11:59:07.325+02:00 level=ERROR msg="squeue fallback parse error: failed on line 0 ``"
time=2024-06-20T11:59:22.327+02:00 level=ERROR msg="squeue fallback parse error: failed on line 0 ``"
time=2024-06-20T11:59:37.323+02:00 level=ERROR msg="squeue fallback parse error: failed on line 0 ``"
time=2024-06-20T11:59:52.329+02:00 level=ERROR msg="squeue fallback parse error: failed on line 0 ``"
time=2024-06-20T12:00:07.327+02:00 level=ERROR msg="squeue fallback parse error: failed on line 0 ``"
time=2024-06-20T12:00:22.330+02:00 level=ERROR msg="squeue fallback parse error: failed on line 0 ``"
time=2024-06-20T12:00:37.329+02:00 level=ERROR msg="squeue fallback parse error: failed on line 0 ``"
time=2024-06-20T12:00:52.326+02:00 level=ERROR msg="squeue fallback parse error: failed on line 0 ``"
time=2024-06-20T12:01:07.328+02:00 level=ERROR msg="squeue fallback parse error: failed on line 0 ``"
time=2024-06-20T12:01:22.327+02:00 level=ERROR msg="squeue fallback parse error: failed on line 0 ``"
time=2024-06-20T12:01:37.324+02:00 level=ERROR msg="squeue fallback parse error: failed on line 0 ``"
time=2024-06-20T12:01:52.324+02:00 level=ERROR msg="squeue fallback parse error: failed on line 0 ``"
time=2024-06-20T12:02:07.324+02:00 level=ERROR msg="squeue fallback parse error: failed on line 0 ``"
time=2024-06-20T12:02:22.338+02:00 level=ERROR msg="squeue fallback parse error: failed on line 0 ``"
time=2024-06-20T12:02:37.323+02:00 level=ERROR msg="squeue fallback parse error: failed on line 0 ``"
time=2024-06-20T12:02:52.333+02:00 level=ERROR msg="squeue fallback parse error: failed on line 0 ``"
time=2024-06-20T12:03:07.326+02:00 level=ERROR msg="squeue fallback parse error: failed on line 0 ``"
time=2024-06-20T12:03:22.329+02:00 level=ERROR msg="squeue fallback parse error: failed on line 0 ``"
time=2024-06-20T12:03:37.330+02:00 level=ERROR msg="squeue fallback parse error: failed on line 0 ``"
time=2024-06-20T12:03:52.338+02:00 level=ERROR msg="squeue fallback parse error: failed on line 0 ``"
time=2024-06-20T12:04:07.327+02:00 level=ERROR msg="squeue fallback parse error: failed on line 0 ``"
time=2024-06-20T12:04:22.324+02:00 level=ERROR msg="squeue fallback parse error: failed on line 0 ``"
time=2024-06-20T12:04:37.326+02:00 level=ERROR msg="squeue fallback parse error: failed on line 0 ``"
time=2024-06-20T12:04:52.324+02:00 level=ERROR msg="squeue fallback parse error: failed on line 0 ``"
time=2024-06-20T12:05:07.325+02:00 level=ERROR msg="squeue fallback parse error: failed on line 0 ``"
time=2024-06-20T12:05:22.323+02:00 level=ERROR msg="squeue fallback parse error: failed on line 0 ``"
time=2024-06-20T12:05:37.328+02:00 level=ERROR msg="squeue fallback parse error: failed on line 0 ``"
time=2024-06-20T12:05:52.324+02:00 level=ERROR msg="squeue fallback parse error: failed on line 0 ``"
time=2024-06-20T12:06:07.329+02:00 level=ERROR msg="squeue fallback parse error: failed on line 0 ``"
time=2024-06-20T12:06:22.343+02:00 level=ERROR msg="squeue fallback parse error: failed on line 0 ``"
time=2024-06-20T12:06:37.330+02:00 level=ERROR msg="squeue fallback parse error: failed on line 0 ``"
time=2024-06-20T12:06:52.324+02:00 level=ERROR msg="squeue fallback parse error: failed on line 0 ``"
time=2024-06-20T12:07:07.324+02:00 level=ERROR msg="squeue fallback parse error: failed on line 0 ``"
time=2024-06-20T12:07:22.341+02:00 level=ERROR msg="squeue fallback parse error: failed on line 0 ``"
time=2024-06-20T12:07:37.326+02:00 level=ERROR msg="squeue fallback parse error: failed on line 0 ``"
time=2024-06-20T12:07:52.325+02:00 level=ERROR msg="squeue fallback parse error: failed on line 0 ``"
time=2024-06-20T12:08:07.327+02:00 level=ERROR msg="squeue fallback parse error: failed on line 0 ``"
time=2024-06-20T12:08:22.343+02:00 level=ERROR msg="squeue fallback parse error: failed on line 0 ``"
time=2024-06-20T12:08:37.324+02:00 level=ERROR msg="squeue fallback parse error: failed on line 0 ``"
time=2024-06-20T12:08:52.327+02:00 level=ERROR msg="squeue fallback parse error: failed on line 0 ``"
time=2024-06-20T12:09:07.326+02:00 level=ERROR msg="squeue fallback parse error: failed on line 0 ``"
time=2024-06-20T12:09:22.329+02:00 level=ERROR msg="squeue fallback parse error: failed on line 0 ``"
time=2024-06-20T12:09:37.326+02:00 level=ERROR msg="squeue fallback parse error: failed on line 0 ``"
time=2024-06-20T12:09:52.333+02:00 level=ERROR msg="squeue fallback parse error: failed on line 0 ``"
time=2024-06-20T12:10:07.328+02:00 level=ERROR msg="squeue fallback parse error: failed on line 0 ``"
time=2024-06-20T12:10:22.353+02:00 level=ERROR msg="squeue fallback parse error: failed on line 0 ``"
time=2024-06-20T12:10:37.329+02:00 level=ERROR msg="squeue fallback parse error: failed on line 0 ``"
time=2024-06-20T12:10:52.325+02:00 level=ERROR msg="squeue fallback parse error: failed on line 0 ``"
time=2024-06-20T12:11:07.323+02:00 level=ERROR msg="squeue fallback parse error: failed on line 0 ``"
time=2024-06-20T12:11:22.324+02:00 level=ERROR msg="squeue fallback parse error: failed on line 0 ``"
time=2024-06-20T12:11:37.336+02:00 level=ERROR msg="squeue fallback parse error: failed on line 0 ``"
time=2024-06-20T12:11:52.324+02:00 level=ERROR msg="squeue fallback parse error: failed on line 0 ``"
time=2024-06-20T12:12:07.338+02:00 level=ERROR msg="squeue fallback parse error: failed on line 0 ``"
time=2024-06-20T12:12:22.324+02:00 level=ERROR msg="squeue fallback parse error: failed on line 0 ``"
time=2024-06-20T12:12:37.332+02:00 level=ERROR msg="squeue fallback parse error: failed on line 0 ``"
time=2024-06-20T12:12:52.323+02:00 level=ERROR msg="squeue fallback parse error: failed on line 0 ``"
time=2024-06-20T12:13:07.330+02:00 level=ERROR msg="squeue fallback parse error: failed on line 0 ``"
time=2024-06-20T12:13:22.337+02:00 level=ERROR msg="squeue fallback parse error: failed on line 0 ``"
time=2024-06-20T12:13:37.331+02:00 level=ERROR msg="squeue fallback parse error: failed on line 0 ``"
time=2024-06-20T12:13:52.336+02:00 level=ERROR msg="squeue fallback parse error: failed on line 0 ``"
time=2024-06-20T12:14:07.323+02:00 level=ERROR msg="squeue fallback parse error: failed on line 0 ``"
time=2024-06-20T12:14:22.322+02:00 level=ERROR msg="squeue fallback parse error: failed on line 0 ``"
time=2024-06-20T12:14:37.326+02:00 level=ERROR msg="squeue fallback parse error: failed on line 0 ``"
time=2024-06-20T12:14:52.331+02:00 level=ERROR msg="squeue fallback parse error: failed on line 0 ``"
time=2024-06-20T12:15:07.325+02:00 level=ERROR msg="squeue fallback parse error: failed on line 0 ``"
time=2024-06-20T12:15:22.330+02:00 level=ERROR msg="squeue fallback parse error: failed on line 0 ``"
time=2024-06-20T12:15:37.330+02:00 level=ERROR msg="squeue fallback parse error: failed on line 0 ``"
time=2024-06-20T12:15:52.328+02:00 level=ERROR msg="squeue fallback parse error: failed on line 0 `

The slurm version

root@rgEval:/etc/alloy# squeue --version
slurm 23.02.7
root@rgEval:/etc/alloy#
abhinavDhulipala commented 2 weeks ago

Well, we are getting somewhere. Could you tell me whether the following command provides a viable output and give me some of the example output (with no proprietary data off course)? I'm thinking for some reason maybe squeue is printing no output/an extra empty line. Perhaps one of the cli output options works differently on that version of slurm.

squeue --states=all -h -r -o '{"a": "%a", "id": %A, "end_time": "%e", "u": "%u", "state": "%T", "p": "%P", "cpu": %C, "mem": "%m", "array_id": "%K"}'

Also are you getting any job stats at all? The logs print if a line errors but continues executing on the rest of the output.

codeknight03 commented 2 weeks ago

This command somehow does not work. Which version of slurm have you guys tested this on ?

squeue --states=all -h -r -o {"a": "%a", "id": %A, "end_time": "%e", "u": "%u", "state": "%T", "p": "%P", "cpu": %C, "mem": "%m", "array_id": "%K"}
squeue: error: Unrecognized option: %a,
Usage: squeue [-A account] [--clusters names] [-i seconds] [--job jobid]
              [-n name] [-o format] [-p partitions] [--qos qos]
              [--reservation reservation] [--sort fields] [--start]
              [--step step_id] [-t states] [-u user_name] [--usage]
              [-L licenses] [-w nodes] [--federation] [--local] [--sibling]
          [-ahjlrsv]

Although while attempting to debug , I changed the command a little which works and does print an output

squeue --states=all -h -r -o "%a %a %e %u %t %p  %c %m %k"
root root N/A root PD 0.99998472956940  1 0 (null)
codeknight03 commented 2 weeks ago

We are getting the cluster related metrics like nodes and stuff mostly the sinfo information. But no job data. We haven't enabled tracing on our jobs though.

abhinavDhulipala commented 2 weeks ago

Howdy, the version of slurm we have is: slurm 23.02.4. I checked docs for 23.02.4 and 23.02.7 and found no changes to the squeue output format. For the problem above, do you mind

  1. Trying to put the format statement in single quotes
  2. Try using a bash/zsh shell?
  3. You could try modifying the statement and passing it into -slurm.squeue-cli option. Not the option splits by the space character, so you'd have to do something like:
prometheus-slurm-exporter -slurm.squeue-cli squeue <modified cmd>

Or you could make a small squeue_fetch script with the command that works for your machine and do the following:

#!/bin/bash

set -e

# in squeue_wrapper.sh
# keep modifiying the below statement till the output works
squeue --states=all -h -r -o '{"a": "%a", "id": %A, "end_time": "%e", "u": "%u", "state": "%T", "p": "%P", "cpu": %C, "mem": "%m", "array_id": "%K"}'

Then invoke it in the exporter:

prometheus-slurm-exporter -slurm.squeue-cli ./squeue_wrapper.sh

Unfortunately, if you are using slurm 23, trying to use the dataparser json option might not work for you, but you could give it a shot (-slurm.cli-fallback=false) or contribute to that fetcher if you'd like. I started work on that here.

What's more puzzling is that you should get the following log message if the fetcher fails as you've describe above:

time=2024-06-20T12:19:21.446-07:00 level=ERROR msg="fetcher failure %q" !BADKEY="exit status 1"

So that's even more confusing.

Also, is you queue empty? In which case the error messages are harmless, but I could fix them

abhinavDhulipala commented 2 weeks ago

@codeknight03 I think I reproduced the problem and patched it here. I don't think you had any jobs running. That case should now be handled. Feel free to test it out again with:

go install github.com/rivosinc/prometheus-slurm-exporter@util-int64
abhinavDhulipala commented 1 week ago

I'm going to merge the MR with the type fix in. If there are any other job problems, we can handle them in a separate MR.

codeknight03 commented 1 week ago

Thanks for the quick turnaround @abhinavDhulipala . I also have a few free hours every to take up some HPC challenges. Let me know how I can help you. I am interested in implementing the support for rest. Let me know a good point to start.