hashicorp / nomad

Nomad is an easy-to-use, flexible, and performant workload orchestrator that can deploy a mix of microservice, batch, containerized, and non-containerized applications. Nomad is easy to operate and scale and has native Consul and Vault integrations.
https://www.nomadproject.io/
Other
14.83k stars 1.95k forks source link

Services not unregistered #16616

Closed dani closed 4 months ago

dani commented 1 year ago

Just upgraded to Nomad 1.5.2. Since then, services are not always unregistered from Consul service catalog when they are shuted down / upgraded. So old services versions appear as failed, eg

image

Environment :

Haven't found yet a pattern to reproduce it 100% of the time

tgross commented 4 months ago

Another way to repro is to do systemctl restart nomad on client(s) - When I run this, service templates are getting messed up.

@blmhemu can you clarify this? What's "messed up" mean here? ~(Note that https://github.com/hashicorp/nomad/issues/19542 is unrelated, so if the API output is fine, you should take that over to #19542 and that'll get resolved over there.)~ Nevermind, I see what you mean in https://github.com/hashicorp/nomad/issues/18203.

tgross commented 4 months ago

Thanks @linuxoid69! So that looks like it's the Consul agent token configuration. The allowed configuration for that changed a while back, so that reinforces what was described in https://github.com/hashicorp/nomad/issues/16616#issuecomment-1494368489.

tgross commented 4 months ago

Hi folks, just a quick update... I still don't have a reproduction for this issue. But by revisiting the code, I see a few places where it's possible to drop deregistrations and places where we could be ensuring data integrity between allocations and their service registrations.

Without a reliable reproduction that covers everyone's reports I can't guarantee that fixing the above problems will close out this issue for Nomad services forever. But these are at least plausibly involved. I'm going to work up a patch or patch series for these, and should have those up for review in the next few days assuming all goes well.

Thanks again everyone for your patience with this issue!

tgross commented 4 months ago

I've broken out the data integrity fixes to https://github.com/hashicorp/nomad/pull/20590, and I'll do the client-side work as a separate PR.

tgross commented 4 months ago

Client-side work is up in https://github.com/hashicorp/nomad/pull/20596

tgross commented 4 months ago

Ok folks, #20596 and #20590 have both been merged and will ship in the upcoming Nomad 1.8.0 (and supported backport versions). I'm going to close this issue.

"But I saw it again!"

If you see services left behind after an allocation has stopped from Nomad 1.8.0 or beyond, please let us know. We may move reports to a new issue in order to properly triage. Make sure to include the following:

natemollica-nm commented 3 months ago

@tgross Looks like we have indication of this again in Nomad v1.8.1

Will work to get a working reproduction on the side to deliver if needed just let me know in Slack. Wanted to post here to keep track.

ngcmac commented 3 months ago

Hi @tgross ,

We also still have this issue in our CI environment (1 server nomad + consul + vault + 3 clients). We start to observe an increase on this after we enabled WI for Consul.

Thanks

tgross commented 2 months ago

@ngcmac I'm going to continue the investigation for the case where we're getting "ACL not found" with Workload Identities in https://github.com/hashicorp/nomad/issues/23494. @natemollica-nm is going to do some follow-up on his report and we may or may not adjust #23494 to cover that case as well depending on the outcome.