We use the cgroup_exporter in more compute node, but sometime we got this error message:
# curl localhost:9306/metrics
An error has occurred while serving metrics:
91 error(s) occurred:
* [from Gatherer #1] collected metric "cgroup_uid" { label:{name:"jobid" value:""} gauge:{value:0}} was collected before with the same name and label values
* [from Gatherer #1] collected metric "cgroup_cpu_user_seconds" { label:{name:"jobid" value:""} label:{name:"step" value:""} label:{name:"task" value:""} gauge:{value:395.08}} was collected before with the same name and label values
* [from Gatherer #1] collected metric "cgroup_cpu_system_seconds" { label:{name:"jobid" value:""} label:{name:"step" value:""} label:{name:"task" value:""} gauge:{value:16.58}} was collected before with the same name and label values
* [from Gatherer #1] collected metric "cgroup_cpu_total_seconds" { label:{name:"jobid" value:""} label:{name:"step" value:""} label:{name:"task" value:""} gauge:{value:411.858161192}} was collected before with the same name and label values
* [from Gatherer #1] collected metric "cgroup_cpus" { label:{name:"jobid" value:""} label:{name:"step" value:""} label:{name:"task" value:""} gauge:{value:64}} was collected before with the same name and label values
* [from Gatherer #1] collected metric "cgroup_memory_rss_bytes" { label:{name:"jobid" value:""} label:{name:"step" value:""} label:{name:"task" value:""} gauge:{value:0}} was collected before with the same name and label values
* [from Gatherer #1] collected metric "cgroup_memory_cache_bytes" { label:{name:"jobid" value:""} label:{name:"step" value:""} label:{name:"task" value:""} gauge:{value:0}} was collected before with the same name and label values
* [from Gatherer #1] collected metric "cgroup_memory_used_bytes" { label:{name:"jobid" value:""} label:{name:"step" value:""} label:{name:"task" value:""} gauge:{value:0}} was collected before with the same name and label values
* [from Gatherer #1] collected metric "cgroup_memory_total_bytes" { label:{name:"jobid" value:""} label:{name:"step" value:""} label:{name:"task" value:""} gauge:{value:0}} was collected before with the same name and label values
* [from Gatherer #1] collected metric "cgroup_memory_fail_count" { label:{name:"jobid" value:""} label:{name:"step" value:""} label:{name:"task" value:""} gauge:{value:0}} was collected before with the same name and label values
* [from Gatherer #1] collected metric "cgroup_memsw_used_bytes" { label:{name:"jobid" value:""} label:{name:"step" value:""} label:{name:"task" value:""} gauge:{value:0}} was collected before with the same name and label values
* [from Gatherer #1] collected metric "cgroup_memsw_total_bytes" { label:{name:"jobid" value:""} label:{name:"step" value:""} label:{name:"task" value:""} gauge:{value:0}} was collected before with the same name and label values
* [from Gatherer #1] collected metric "cgroup_memsw_fail_count" { label:{name:"jobid" value:""} label:{name:"step" value:""} label:{name:"task" value:""} gauge:{value:0}} was collected before with the same name and label values
* [from Gatherer #1] collected metric "cgroup_uid" { label:{name:"jobid" value:""} gauge:{value:0}} was collected before with the same name and label values
...
I tried to restart the cgroup_exporter and the slurmd services, but it didn't solve the problem. After I rebooted the whole compute node, the issue resolved. Do you have any idea?
Dear Team,
We use the cgroup_exporter in more compute node, but sometime we got this error message:
I tried to restart the cgroup_exporter and the slurmd services, but it didn't solve the problem. After I rebooted the whole compute node, the issue resolved. Do you have any idea?