Closed rbastiaans-tc closed 1 year ago
Hi @rbastiaans-tc š
Thanks for the report. Unfortunately we will need more information in order to understand what's going wrong.
Looking at the code this panic is raised by the Prometheus client library when the counter attempts to add a negative value.
This library is used by go-metrics
in the IncrCounter
and IncrCounterWithLabels
methods, but all uses of these methods in Nomad pass either 1
or the len(...)
of something, which is always positive.
So in order to understand what's going on with your cluster I built custom binaries that output more information when the panic occurs. You can find them at the bottom of this page: https://github.com/hashicorp/nomad/actions/runs/4000378647
One very important thing is that these binaries are for test only and should not be run in production, so if you could, please copy the data_dir
of the agent presenting this problem somewhere else and run the custom binary pointing to the new directory.
The changes included in these binaries can be viewed here:
Another thing that could be relevant, what CPU architecture are you using?
Hi @rbastiaans-tc š
@lgfa29 I'm not sure how soon and if we can run that on short term. As far as I know now, unfortunately we only had this issue occur once, in our production cluster, so far.
Upgrading Nomad version even in our dev-environment is not that easy, because we are suffering from another Nomad bug where Nomad agent restarts causes jobs using CSI Volumes to get lost allocations https://github.com/hashicorp/nomad/issues/13028. Therefor we cannot do in-place upgrades that easily.
Even if we could put your modified version in place easily, it might take a long time for this bug to re-occur.
I was planning to upgrade to 1.3.8 first to hopefully get rid of some of these bugs associated with CSI volumes.
Besides the root cause of the panic, I would imagine the Nomad process should never crash because of a issue with just telemetry or monitoring metrics only. Especially in production environments.
Wouldn't catching that panic be a relatively easy fix? I will see about running that modified version, but I'm not sure if I can do that anytime soon.
Another thing that could be relevant, what CPU architecture are you using?
We are running x86_64 or "amd64" version of Debian.
$ uname -a
Linux xxx 4.19.0-20-cloud-amd64 #1 SMP Debian 4.19.235-1 (2022-03-17) x86_64 GNU/Linux
$ file nomad
nomad: ELF 64-bit LSB executable, x86-64, version 1 (SYSV), dynamically linked, interpreter /lib64/ld-linux-x86-64.so.2, BuildID[sha1]=8aa832ffba2495157c039cc24548a9738520f98e, for GNU/Linux 3.2.0, not stripped
Ah no worries, I thought this something that was always happening when your agent started so it could be tested quickly.
Besides the root cause of the panic, I would imagine the Nomad process should never crash because of a issue with just telemetry or monitoring metrics only. Especially in production environments.
Wouldn't catching that panic be a relatively easy fix?
I definitely agree with this, but it's not so simple to fix this from the Nomad side. The telemetry library we use is called in several places and, as far as I can tell, we always send values that are supposed to be positive.
I will have to discuss with the rest of the team how to best handle this. It would probably require changes to the go-metrics
module.
Ah no worries, I thought this something that was always happening when your agent started so it could be tested quickly.
Ah perhaps that wasn't clear from my report. It happened after running Nomad for a long time, on only 1 machine in the cluster so far.
Thanks so much for looking into this @lgfa29
Got it, thanks! Yeah, we've been using this library for a while and haven't seen any reports about this before. I also haven't heard from other teams that also use it. So it seems like it was a unfortunate, but very rare, situation.
This happened again last night @lgfa29
Different machine, same Nomad version.
Mar 2 23:35:58 nomadclient002 nomad[8008]: panic: counter cannot decrease in value
Mar 2 23:35:58 nomadclient002 nomad[8008]: goroutine 8369928 [running]:
Mar 2 23:35:58 nomadclient002 nomad[8008]: github.com/prometheus/client_golang/prometheus.(*counter).Add(...)
Mar 2 23:35:58 nomadclient002 nomad[8008]: #011github.com/prometheus/client_golang@v1.12.0/prometheus/counter.go:109
Mar 2 23:35:58 nomadclient002 nomad[8008]: github.com/prometheus/client_golang/prometheus.(*goCollector).Collect(0xc000058180, 0xc001c11f60)
Mar 2 23:35:58 nomadclient002 nomad[8008]: #011github.com/prometheus/client_golang@v1.12.0/prometheus/go_collector_go117.go:147 +0x5ec
Mar 2 23:35:58 nomadclient002 nomad[8008]: github.com/prometheus/client_golang/prometheus.(*Registry).Gather.func1()
Mar 2 23:35:58 nomadclient002 nomad[8008]: #011github.com/prometheus/client_golang@v1.12.0/prometheus/registry.go:446 +0x102
Mar 2 23:35:58 nomadclient002 nomad[8008]: created by github.com/prometheus/client_golang/prometheus.(*Registry).Gather
Mar 2 23:35:58 nomadclient002 nomad[8008]: #011github.com/prometheus/client_golang@v1.12.0/prometheus/registry.go:538 +0xb4d
Mar 2 23:35:58 nomadclient002 systemd[1]: nomad.service: Main process exited, code=exited, status=2/INVALIDARGUMENT
Mar 2 23:35:58 nomadclient002 systemd[1]: nomad.service: Failed with result 'exit-code'.
It occurs to me that any stack trace isn't going to be that useful b/c it's getting messages over a channel, so this is just the prom goroutine. I wonder if we're starting from the wrong position here. If we know we aren't trying to decrement a counter, maybe something else is? The prometheus client gets metrics from the go runtime. I don't see anything obvious in go_collector.go
or in the golang issues for MemStats, but it'd be worth some investigation here I think.
Ah that's right, we emit metrics for the runtime environment, and probably other things I'm not remembering right now š
I think the best option we have is to move forward with https://github.com/hashicorp/go-metrics/pull/146, that would prevent crashes and also give us more information about which metric is behaving unexpectedly.
@tgross @lgfa29 Can we please get that metrics PR merged and that into a Nomad release?
This is still happening for me on occasion and in production environments it's not really great, in combination with that CSI volume bug. It means that whenever Nomad dies of this panic, also all our jobs running there using CSI volumes get killed and rescheduled.
So if we could get a fix at least for this panic it will already help us, regardless of the CSI volume bug which seems more difficult to tackle. This metrics PR seems like an easier win.
@lhossack has reported seeing the logging we added to detect this case in https://github.com/hashicorp/nomad/issues/18804, so that's a good start on next steps for this issue.
Fix in https://github.com/hashicorp/nomad/pull/18835. See my comment here https://github.com/hashicorp/nomad/issues/18804#issuecomment-1775331491 for a detailed breakdown of the problem.
I think it is related to https://github.com/prometheus/client_golang/pull/969
Hi @chenk008 š
Have you encountered this issue after https://github.com/hashicorp/nomad/pull/18835 was released in Nomad 1.6.3?
@lgfa29 Sorry for disturb you, i have the same error in my project with client_golang 1.12.0. After searching, I only found this result, so I left a comment. The goroutine stack is in client_golang, it is a bug in client_golang 1.12.0.
No worries @chenk008, I mostly trying to understand the exact issue you've experience.
Would you be able to open a new issue and post the exact error from log output and Nomad version you're using?
Thanks!
Nomad version
Operating system and Environment details
Issue
The Nomad agent process died with a message:
panic: counter cannot decrease in value
See full panic output below.
Reproduction steps
Unknown
Expected Result
Nomad stay running
Actual Result
Nomad process exits
Job file (if appropriate)
Nomad Server logs (if appropriate)
Nomad Client logs (if appropriate)
Nomad config