hashicorp / nomad

Nomad is an easy-to-use, flexible, and performant workload orchestrator that can deploy a mix of microservice, batch, containerized, and non-containerized applications. Nomad is easy to operate and scale and has native Consul and Vault integrations.
https://www.nomadproject.io/
Other
15k stars 1.96k forks source link

nomad process dies with `panic: counter cannot decrease in value` #15861

Closed rbastiaans-tc closed 1 year ago

rbastiaans-tc commented 1 year ago

Nomad version

Nomad v1.3.1 (2b054e38e91af964d1235faa98c286ca3f527e56)

Operating system and Environment details

Distributor ID: Debian
Description:    Debian GNU/Linux 10 (buster)
Release:    10
Codename:   buster

Issue

The Nomad agent process died with a message: panic: counter cannot decrease in value

See full panic output below.

Reproduction steps

Unknown

Expected Result

Nomad stay running

Actual Result

Nomad process exits

Job file (if appropriate)

Nomad Server logs (if appropriate)

Nomad Client logs (if appropriate)

Sun 2023-01-22 16:57:52 UTC nomadclient013 nomad[3403]: ==> Newer Nomad version available: 1.4.3 (currently running: 1.3.1)
Sun 2023-01-22 16:58:41 UTC nomadclient013 nomad[3403]: panic: counter cannot decrease in value
Sun 2023-01-22 16:58:41 UTC nomadclient013 nomad[3403]: goroutine 7598137 [running]:
Sun 2023-01-22 16:58:41 UTC nomadclient013 nomad[3403]: github.com/prometheus/client_golang/prometheus.(*counter).Add(...)
Sun 2023-01-22 16:58:41 UTC nomadclient013 nomad[3403]:         github.com/prometheus/client_golang@v1.12.0/prometheus/counter.go:109
Sun 2023-01-22 16:58:41 UTC nomadclient013 nomad[3403]: github.com/prometheus/client_golang/prometheus.(*goCollector).Collect(0xc000280080, 0xc00235bf60)
Sun 2023-01-22 16:58:41 UTC nomadclient013 nomad[3403]:         github.com/prometheus/client_golang@v1.12.0/prometheus/go_collector_go117.go:147 +0x5ec
Sun 2023-01-22 16:58:41 UTC nomadclient013 nomad[3403]: github.com/prometheus/client_golang/prometheus.(*Registry).Gather.func1()
Sun 2023-01-22 16:58:41 UTC nomadclient013 nomad[3403]:         github.com/prometheus/client_golang@v1.12.0/prometheus/registry.go:446 +0x102
Sun 2023-01-22 16:58:41 UTC nomadclient013 nomad[3403]: created by github.com/prometheus/client_golang/prometheus.(*Registry).Gather
Sun 2023-01-22 16:58:41 UTC nomadclient013 nomad[3403]:         github.com/prometheus/client_golang@v1.12.0/prometheus/registry.go:457 +0x4e8
Sun 2023-01-22 16:58:41 UTC nomadclient013 systemd[1]: nomad.service: Main process exited, code=exited, status=2/INVALIDARGUMENT
Sun 2023-01-22 16:58:41 UTC nomadclient013 systemd[1]: nomad.service: Failed with result 'exit-code'.
Sun 2023-01-22 16:58:43 UTC nomadclient013 systemd[1]: nomad.service: Service RestartSec=2s expired, scheduling restart.
Sun 2023-01-22 16:58:43 UTC nomadclient013 systemd[1]: nomad.service: Scheduled restart job, restart counter is at 1.

Nomad config

data_dir = "/var/lib/nomad"

client {
    enabled = true
    options {
        docker.auth.config = "/etc/docker/dockercfg.json"
    }
    meta {
        az = "az-1"
    }
}

vault {
    enabled = true
    tls_skip_verify = true
    address = "https://xxxx"
    namespace = "xxxx"
}

telemetry {
    publish_allocation_metrics = true
    publish_node_metrics = true
    prometheus_metrics = true

}

consul {
    tags = ["no-http"]
}
lgfa29 commented 1 year ago

Hi @rbastiaans-tc šŸ‘‹

Thanks for the report. Unfortunately we will need more information in order to understand what's going wrong.

Looking at the code this panic is raised by the Prometheus client library when the counter attempts to add a negative value.

This library is used by go-metrics in the IncrCounter and IncrCounterWithLabels methods, but all uses of these methods in Nomad pass either 1 or the len(...) of something, which is always positive.

So in order to understand what's going on with your cluster I built custom binaries that output more information when the panic occurs. You can find them at the bottom of this page: https://github.com/hashicorp/nomad/actions/runs/4000378647

One very important thing is that these binaries are for test only and should not be run in production, so if you could, please copy the data_dir of the agent presenting this problem somewhere else and run the custom binary pointing to the new directory.

The changes included in these binaries can be viewed here:

Another thing that could be relevant, what CPU architecture are you using?

rbastiaans-tc commented 1 year ago

Hi @rbastiaans-tc šŸ‘‹

@lgfa29 I'm not sure how soon and if we can run that on short term. As far as I know now, unfortunately we only had this issue occur once, in our production cluster, so far.

Upgrading Nomad version even in our dev-environment is not that easy, because we are suffering from another Nomad bug where Nomad agent restarts causes jobs using CSI Volumes to get lost allocations https://github.com/hashicorp/nomad/issues/13028. Therefor we cannot do in-place upgrades that easily.

Even if we could put your modified version in place easily, it might take a long time for this bug to re-occur.

I was planning to upgrade to 1.3.8 first to hopefully get rid of some of these bugs associated with CSI volumes.

Besides the root cause of the panic, I would imagine the Nomad process should never crash because of a issue with just telemetry or monitoring metrics only. Especially in production environments.

Wouldn't catching that panic be a relatively easy fix? I will see about running that modified version, but I'm not sure if I can do that anytime soon.

Another thing that could be relevant, what CPU architecture are you using?

We are running x86_64 or "amd64" version of Debian.

$ uname -a
Linux xxx 4.19.0-20-cloud-amd64 #1 SMP Debian 4.19.235-1 (2022-03-17) x86_64 GNU/Linux

$ file nomad
nomad: ELF 64-bit LSB executable, x86-64, version 1 (SYSV), dynamically linked, interpreter /lib64/ld-linux-x86-64.so.2, BuildID[sha1]=8aa832ffba2495157c039cc24548a9738520f98e, for GNU/Linux 3.2.0, not stripped
lgfa29 commented 1 year ago

Ah no worries, I thought this something that was always happening when your agent started so it could be tested quickly.

Besides the root cause of the panic, I would imagine the Nomad process should never crash because of a issue with just telemetry or monitoring metrics only. Especially in production environments.

Wouldn't catching that panic be a relatively easy fix?

I definitely agree with this, but it's not so simple to fix this from the Nomad side. The telemetry library we use is called in several places and, as far as I can tell, we always send values that are supposed to be positive.

I will have to discuss with the rest of the team how to best handle this. It would probably require changes to the go-metrics module.

rbastiaans-tc commented 1 year ago

Ah no worries, I thought this something that was always happening when your agent started so it could be tested quickly.

Ah perhaps that wasn't clear from my report. It happened after running Nomad for a long time, on only 1 machine in the cluster so far.

Thanks so much for looking into this @lgfa29

lgfa29 commented 1 year ago

Got it, thanks! Yeah, we've been using this library for a while and haven't seen any reports about this before. I also haven't heard from other teams that also use it. So it seems like it was a unfortunate, but very rare, situation.

rbastiaans-tc commented 1 year ago

This happened again last night @lgfa29

Different machine, same Nomad version.

Mar  2 23:35:58 nomadclient002 nomad[8008]: panic: counter cannot decrease in value
Mar  2 23:35:58 nomadclient002 nomad[8008]: goroutine 8369928 [running]:
Mar  2 23:35:58 nomadclient002 nomad[8008]: github.com/prometheus/client_golang/prometheus.(*counter).Add(...)
Mar  2 23:35:58 nomadclient002 nomad[8008]: #011github.com/prometheus/client_golang@v1.12.0/prometheus/counter.go:109
Mar  2 23:35:58 nomadclient002 nomad[8008]: github.com/prometheus/client_golang/prometheus.(*goCollector).Collect(0xc000058180, 0xc001c11f60)
Mar  2 23:35:58 nomadclient002 nomad[8008]: #011github.com/prometheus/client_golang@v1.12.0/prometheus/go_collector_go117.go:147 +0x5ec
Mar  2 23:35:58 nomadclient002 nomad[8008]: github.com/prometheus/client_golang/prometheus.(*Registry).Gather.func1()
Mar  2 23:35:58 nomadclient002 nomad[8008]: #011github.com/prometheus/client_golang@v1.12.0/prometheus/registry.go:446 +0x102
Mar  2 23:35:58 nomadclient002 nomad[8008]: created by github.com/prometheus/client_golang/prometheus.(*Registry).Gather
Mar  2 23:35:58 nomadclient002 nomad[8008]: #011github.com/prometheus/client_golang@v1.12.0/prometheus/registry.go:538 +0xb4d
Mar  2 23:35:58 nomadclient002 systemd[1]: nomad.service: Main process exited, code=exited, status=2/INVALIDARGUMENT
Mar  2 23:35:58 nomadclient002 systemd[1]: nomad.service: Failed with result 'exit-code'.
tgross commented 1 year ago

It occurs to me that any stack trace isn't going to be that useful b/c it's getting messages over a channel, so this is just the prom goroutine. I wonder if we're starting from the wrong position here. If we know we aren't trying to decrement a counter, maybe something else is? The prometheus client gets metrics from the go runtime. I don't see anything obvious in go_collector.go or in the golang issues for MemStats, but it'd be worth some investigation here I think.

lgfa29 commented 1 year ago

Ah that's right, we emit metrics for the runtime environment, and probably other things I'm not remembering right now šŸ˜…

I think the best option we have is to move forward with https://github.com/hashicorp/go-metrics/pull/146, that would prevent crashes and also give us more information about which metric is behaving unexpectedly.

rbastiaans-tc commented 1 year ago

@tgross @lgfa29 Can we please get that metrics PR merged and that into a Nomad release?

This is still happening for me on occasion and in production environments it's not really great, in combination with that CSI volume bug. It means that whenever Nomad dies of this panic, also all our jobs running there using CSI volumes get killed and rescheduled.

So if we could get a fix at least for this panic it will already help us, regardless of the CSI volume bug which seems more difficult to tackle. This metrics PR seems like an easier win.

tgross commented 1 year ago

@lhossack has reported seeing the logging we added to detect this case in https://github.com/hashicorp/nomad/issues/18804, so that's a good start on next steps for this issue.

tgross commented 1 year ago

Fix in https://github.com/hashicorp/nomad/pull/18835. See my comment here https://github.com/hashicorp/nomad/issues/18804#issuecomment-1775331491 for a detailed breakdown of the problem.

chenk008 commented 8 months ago

I think it is related to https://github.com/prometheus/client_golang/pull/969

lgfa29 commented 8 months ago

Hi @chenk008 šŸ‘‹

Have you encountered this issue after https://github.com/hashicorp/nomad/pull/18835 was released in Nomad 1.6.3?

chenk008 commented 8 months ago

@lgfa29 Sorry for disturb you, i have the same error in my project with client_golang 1.12.0. After searching, I only found this result, so I left a comment. The goroutine stack is in client_golang, it is a bug in client_golang 1.12.0.

lgfa29 commented 8 months ago

No worries @chenk008, I mostly trying to understand the exact issue you've experience.

Would you be able to open a new issue and post the exact error from log output and Nomad version you're using?

Thanks!