rivosinc / prometheus-slurm-exporter

Export select slurm metrics to prometheus
Apache License 2.0
37 stars 12 forks source link

Job metrics are missing #78

Closed xpillons closed 2 months ago

xpillons commented 3 months ago

I'm running Slurm 23.11.7 I'm on branch util-int64 as I'm hitting the same issue than in #71

I don't know why I'm not having any jobs here. I used the squeue_wrapper which run fine

$ curl -s http://10.120.0.132:9092/metrics | grep "# HELP slurm_job"
# HELP slurm_job_scrape_duration how long the cmd [./squeue_wrapper.sh] took (ms)

$ curl -s http://10.120.0.132:9092/metrics | grep "# HELP slurm_partition"
# HELP slurm_partition_alloc_mem Alloc mem per partition
# HELP slurm_partition_free_mem Free mem per partition
# HELP slurm_partition_idle_cpus Idle cpus per partition
# HELP slurm_partition_real_mem Real mem per partition
# HELP slurm_partition_total_cpus Total cpus per partition
# HELP slurm_partition_weight Total node weight per partition??
# ./squeue_wrapper.sh
{"a": "(null)", "id": 6, "end_time": "2024-06-21T14:16:58", "u": "xpillons", "state": "COMPLETED", "p": "hpc", "cpu": 480, "mem": "3696M", "array_id": "N/A"}
codeknight03 commented 3 months ago

I am facing the same thing , I though we can only get job metrics if we enable tracing and have tracing running for each job.

xpillons commented 3 months ago

the number of job state per partition shouldn't need the job tracing to be enabled, that doesn't make sense for me

abhinavDhulipala commented 3 months ago

You shouldn't need to enable tracing to get job metrics. @xpillons Are you getting any errors or not worthy logs? I'll take your output and plumb it through the exporter to see if I can reproduce it.

abhinavDhulipala commented 3 months ago

@xpillons I tried to repro your your bug. I used your output and created the following test case in exporter/jobs_test.go

func TestParseCliFallback_Issue77(t *testing.T) {
    assert := assert.New(t)
    fetcher := MockScraper{fixture: "fixtures/bug.txt"}
    data, err := fetcher.FetchRawBytes()
    assert.Nil(err)
    counter := prometheus.NewCounter(prometheus.CounterOpts{Name: "errors"})
    metrics, err := parseCliFallback(data, counter)
    assert.Nil(err)
        assert.Len(metrics, 1)
    assert.Equal(0., CollectCounterValue(counter))
}

And it passed. I then booted a mock cluster single core on the following machine (my codespace):

Specs:

uname -a
Linux codespaces-ee9e26 6.5.0-1021-azure #22~22.04.1-Ubuntu SMP Tue Apr 30 16:08:18 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux

When the queue is empty I get the following using your grep:

curl -s localhost:9093/metrics | grep "# HELP slurm_job"
# HELP slurm_job_scrape_duration how long the cmd [squeue --states=all -h -r -o {"a": "%a", "id": %A, "end_time": "%e", "u": "%u", "state": "%T", "p": "%P", "cpu": %C, "mem": "%m", "array_id": "%K"}] took (ms)

But that doesn't actually yield all the job stats. Try a grep like this:

curl -s localhost:9093/metrics | grep "slurm_.*_alloc"
# HELP slurm_account_cpu_alloc alloc cpu consumed per account
# TYPE slurm_account_cpu_alloc gauge
slurm_account_cpu_alloc{account="(null)"} 3
# HELP slurm_account_mem_alloc alloc mem consumed per account
# TYPE slurm_account_mem_alloc gauge
slurm_account_mem_alloc{account="(null)"} 0
# HELP slurm_feature_cpu_alloc alloc cpu consumed per feature
# TYPE slurm_feature_cpu_alloc gauge
slurm_feature_cpu_alloc{feature=""} 3
# HELP slurm_mem_alloc Total alloc mem
# TYPE slurm_mem_alloc gauge
slurm_mem_alloc -2.64e+08
# HELP slurm_partition_alloc_cpus Alloc cpus per partition
# TYPE slurm_partition_alloc_cpus gauge
slurm_partition_alloc_cpus{partition="debug",state="allocated"} 1
# HELP slurm_user_cpu_alloc total cpu alloc per user
# TYPE slurm_user_cpu_alloc gauge
slurm_user_cpu_alloc{state="COMPLETED",username="root"} 1
slurm_user_cpu_alloc{state="PENDING",username="root"} 1
slurm_user_cpu_alloc{state="RUNNING",username="root"} 1

With the following regex. I get the those stats, which seem to indicate normal operation.

xpillons commented 3 months ago

I've update to 1.5.2, still having job reporting issues These are the metrics for cpu allocated, not for jobs. Here I have a single job submitted for 4 nodes.

image

and I don't seen anything in the dashboard for allocated nodes per partitions image

In the upper variable list, Partition are filled, jobid, Job State, User, Account are all empty. image

The metrics are reported, it's just that they are not showned in grafana

# curl -s http://ccsw-scheduler:9092/metrics | grep "slurm_partition_job"
# HELP slurm_partition_job_state_total total jobs per partition per state
# TYPE slurm_partition_job_state_total gauge
slurm_partition_job_state_total{partition="hpc",state="RUNNING"} 1
abhinavDhulipala commented 3 months ago

I see. Does the following query yield a result at all?

sum by(partition) (slurm_partition_alloc_cpus) / sum by(partition) (slurm_partition_total_cpus)

If you could print what queries you do have, that'd be great. I think the dashboard might be out of date. I'm trying to create a updated one with updated metrics.

xpillons commented 3 months ago

I'm using this dashboard It would be great to provide the same as this one as you are porting the implementation from the old unmaintained repo.

so if I run this query slurm_partition_job_state_total{instance="$instance"} I'm getting records, so the panel is empty because there is no job_state.

so there is data, but it's not reported correctly in the dashboard.

abhinavDhulipala commented 2 months ago

I updated the dashboard a couple weeks ago. Please pull the new link and try again.