Closed xpillons closed 2 months ago
I am facing the same thing , I though we can only get job metrics if we enable tracing and have tracing running for each job.
the number of job state per partition shouldn't need the job tracing to be enabled, that doesn't make sense for me
You shouldn't need to enable tracing to get job metrics. @xpillons Are you getting any errors or not worthy logs? I'll take your output and plumb it through the exporter to see if I can reproduce it.
@xpillons I tried to repro your your bug. I used your output and created the following test case in exporter/jobs_test.go
func TestParseCliFallback_Issue77(t *testing.T) {
assert := assert.New(t)
fetcher := MockScraper{fixture: "fixtures/bug.txt"}
data, err := fetcher.FetchRawBytes()
assert.Nil(err)
counter := prometheus.NewCounter(prometheus.CounterOpts{Name: "errors"})
metrics, err := parseCliFallback(data, counter)
assert.Nil(err)
assert.Len(metrics, 1)
assert.Equal(0., CollectCounterValue(counter))
}
And it passed. I then booted a mock cluster single core on the following machine (my codespace):
Specs:
uname -a
Linux codespaces-ee9e26 6.5.0-1021-azure #22~22.04.1-Ubuntu SMP Tue Apr 30 16:08:18 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux
When the queue is empty I get the following using your grep:
curl -s localhost:9093/metrics | grep "# HELP slurm_job"
# HELP slurm_job_scrape_duration how long the cmd [squeue --states=all -h -r -o {"a": "%a", "id": %A, "end_time": "%e", "u": "%u", "state": "%T", "p": "%P", "cpu": %C, "mem": "%m", "array_id": "%K"}] took (ms)
But that doesn't actually yield all the job stats. Try a grep like this:
curl -s localhost:9093/metrics | grep "slurm_.*_alloc"
# HELP slurm_account_cpu_alloc alloc cpu consumed per account
# TYPE slurm_account_cpu_alloc gauge
slurm_account_cpu_alloc{account="(null)"} 3
# HELP slurm_account_mem_alloc alloc mem consumed per account
# TYPE slurm_account_mem_alloc gauge
slurm_account_mem_alloc{account="(null)"} 0
# HELP slurm_feature_cpu_alloc alloc cpu consumed per feature
# TYPE slurm_feature_cpu_alloc gauge
slurm_feature_cpu_alloc{feature=""} 3
# HELP slurm_mem_alloc Total alloc mem
# TYPE slurm_mem_alloc gauge
slurm_mem_alloc -2.64e+08
# HELP slurm_partition_alloc_cpus Alloc cpus per partition
# TYPE slurm_partition_alloc_cpus gauge
slurm_partition_alloc_cpus{partition="debug",state="allocated"} 1
# HELP slurm_user_cpu_alloc total cpu alloc per user
# TYPE slurm_user_cpu_alloc gauge
slurm_user_cpu_alloc{state="COMPLETED",username="root"} 1
slurm_user_cpu_alloc{state="PENDING",username="root"} 1
slurm_user_cpu_alloc{state="RUNNING",username="root"} 1
With the following regex. I get the those stats, which seem to indicate normal operation.
I've update to 1.5.2, still having job reporting issues These are the metrics for cpu allocated, not for jobs. Here I have a single job submitted for 4 nodes.
and I don't seen anything in the dashboard for allocated nodes per partitions
In the upper variable list, Partition are filled, jobid, Job State, User, Account are all empty.
The metrics are reported, it's just that they are not showned in grafana
# curl -s http://ccsw-scheduler:9092/metrics | grep "slurm_partition_job"
# HELP slurm_partition_job_state_total total jobs per partition per state
# TYPE slurm_partition_job_state_total gauge
slurm_partition_job_state_total{partition="hpc",state="RUNNING"} 1
I see. Does the following query yield a result at all?
sum by(partition) (slurm_partition_alloc_cpus) / sum by(partition) (slurm_partition_total_cpus)
If you could print what queries you do have, that'd be great. I think the dashboard might be out of date. I'm trying to create a updated one with updated metrics.
I'm using this dashboard It would be great to provide the same as this one as you are porting the implementation from the old unmaintained repo.
so if I run this query slurm_partition_job_state_total{instance="$instance"}
I'm getting records, so the panel is empty because there is no job_state.
so there is data, but it's not reported correctly in the dashboard.
I updated the dashboard a couple weeks ago. Please pull the new link and try again.
I'm running Slurm 23.11.7 I'm on branch util-int64 as I'm hitting the same issue than in #71
I don't know why I'm not having any jobs here. I used the squeue_wrapper which run fine