"An error has occurred while serving metrics:"

Atisom commented 4 days ago

Dear Team,

We use the cgroup_exporter in more compute node, but sometime we got this error message:

# curl localhost:9306/metrics
An error has occurred while serving metrics:

91 error(s) occurred:
* [from Gatherer #1] collected metric "cgroup_uid" { label:{name:"jobid" value:""} gauge:{value:0}} was collected before with the same name and label values
* [from Gatherer #1] collected metric "cgroup_cpu_user_seconds" { label:{name:"jobid" value:""} label:{name:"step" value:""} label:{name:"task" value:""} gauge:{value:395.08}} was collected before with the same name and label values
* [from Gatherer #1] collected metric "cgroup_cpu_system_seconds" { label:{name:"jobid" value:""} label:{name:"step" value:""} label:{name:"task" value:""} gauge:{value:16.58}} was collected before with the same name and label values
* [from Gatherer #1] collected metric "cgroup_cpu_total_seconds" { label:{name:"jobid" value:""} label:{name:"step" value:""} label:{name:"task" value:""} gauge:{value:411.858161192}} was collected before with the same name and label values
* [from Gatherer #1] collected metric "cgroup_cpus" { label:{name:"jobid" value:""} label:{name:"step" value:""} label:{name:"task" value:""} gauge:{value:64}} was collected before with the same name and label values
* [from Gatherer #1] collected metric "cgroup_memory_rss_bytes" { label:{name:"jobid" value:""} label:{name:"step" value:""} label:{name:"task" value:""} gauge:{value:0}} was collected before with the same name and label values
* [from Gatherer #1] collected metric "cgroup_memory_cache_bytes" { label:{name:"jobid" value:""} label:{name:"step" value:""} label:{name:"task" value:""} gauge:{value:0}} was collected before with the same name and label values
* [from Gatherer #1] collected metric "cgroup_memory_used_bytes" { label:{name:"jobid" value:""} label:{name:"step" value:""} label:{name:"task" value:""} gauge:{value:0}} was collected before with the same name and label values
* [from Gatherer #1] collected metric "cgroup_memory_total_bytes" { label:{name:"jobid" value:""} label:{name:"step" value:""} label:{name:"task" value:""} gauge:{value:0}} was collected before with the same name and label values
* [from Gatherer #1] collected metric "cgroup_memory_fail_count" { label:{name:"jobid" value:""} label:{name:"step" value:""} label:{name:"task" value:""} gauge:{value:0}} was collected before with the same name and label values
* [from Gatherer #1] collected metric "cgroup_memsw_used_bytes" { label:{name:"jobid" value:""} label:{name:"step" value:""} label:{name:"task" value:""} gauge:{value:0}} was collected before with the same name and label values
* [from Gatherer #1] collected metric "cgroup_memsw_total_bytes" { label:{name:"jobid" value:""} label:{name:"step" value:""} label:{name:"task" value:""} gauge:{value:0}} was collected before with the same name and label values
* [from Gatherer #1] collected metric "cgroup_memsw_fail_count" { label:{name:"jobid" value:""} label:{name:"step" value:""} label:{name:"task" value:""} gauge:{value:0}} was collected before with the same name and label values
* [from Gatherer #1] collected metric "cgroup_uid" { label:{name:"jobid" value:""} gauge:{value:0}} was collected before with the same name and label values
...

# /opt/jobstats/cgroup_exporter --collect.fullslurm --config.paths /slurm
ts=2024-11-15T15:00:26.607Z caller=cgroup_exporter.go:431 level=info msg="Starting cgroup_exporter" version="(version=, branch=, revision=64248e974a586d6fa75e0d1efc9e90c1b06785b8-modified)"
ts=2024-11-15T15:00:26.607Z caller=cgroup_exporter.go:432 level=info msg="Build context" build_context="(go=go1.20.6, platform=linux/amd64, user=, date=, tags=unknown)"
ts=2024-11-15T15:00:26.607Z caller=cgroup_exporter.go:433 level=info msg="Starting Server" address=:9306
# cat /etc/redhat-release
Red Hat Enterprise Linux release 8.9 (Ootpa)
# slurmd -V
slurm 22.05.7
# systemd-cgtop
Control Group                                                                                                                  Tasks   %CPU   Memory  Input/s Output/s
/                                                                                                                               2019 3265.5   102.9G        -        -
/slurm                                                                                                                             - 3176.5    91.2G        -        -
/slurm/uid_13***                                                                                                                   - 3176.5    71.4G        -        -
/slurm/uid_13***/job_87*****                                                                                                       - 3176.6    71.4G        -        -
/slurm/uid_13***                                                                                                                   -      -     6.2G        -        -
/slurm/uid_13***/job_87*****                                                                                                       -      -   103.5M        -        -
# ls /sys/fs/cgroup/cpuacct/slurm
cgroup.clone_children  cpuacct.stat   cpuacct.usage_all     cpuacct.usage_percpu_sys   cpuacct.usage_sys   cpu.cfs_period_us  cpu.rt_period_us   cpu.shares  notify_on_release  uid_13***
cgroup.procs           cpuacct.usage  cpuacct.usage_percpu  cpuacct.usage_percpu_user  cpuacct.usage_user  cpu.cfs_quota_us   cpu.rt_runtime_us  cpu.stat    tasks              uid_13***

I tried to restart the cgroup_exporter and the slurmd services, but it didn't solve the problem. After I rebooted the whole compute node, the issue resolved. Do you have any idea?

Atisom commented 3 days ago

when I remove the '--collect.fullslurm' flag, it works again. Maybe it cannot measure some kind of job?

treydock commented 2 days ago

There is no --collect.fullslurm flag on this exporter. It looks like you may be running a version of this exporter that is a fork with extra functionality.

$ cgroup_exporter --help 2>&1 | grep collect
      --[no-]collect.proc       Boolean that sets if to collect proc information
      --collect.proc.max-exec=100

Atisom commented 2 days ago

oh, sorry for that. The README.md file on the fork repo (https://github.com/plazonic/cgroup_exporter) confused me a bit :)

treydock / cgroup_exporter

"An error has occurred while serving metrics:" #38