Closed kevinschoonover closed 3 months ago
Hi @kevinschoonover, could you clarify which version of Consul you are running please? It does seem to match with the Consul issue.
the agent's consul ACL token which will be deleted and as a result cause the service check to fail to deregister
From the best of my knowledge this is correct.
if nomad is using the catalog deregister API
Nomad utilises the agent API for service registration and deregistration.
The team do wish to make changes to the current Nomad Consul integration process, so I will add this to the backlog board as something to look into if/when we get around to this.
@jrasell I updated the issue with the version
consul version
Consul v1.12.2
Revision 19041f20
Protocol 2 spoken by default, understands 2 to 3 (agent will automatically use protocol >2 when speaking to compatible agents)
I am happy to contribute a fix if possible, I'm just not sure the design y'all would want. The two designs I can think of off the top of my head are:
and am not really happy with either and I don't yet want to give up on rotating nomad's consul tokens. Not sure what kind of bandwidth is available to figure out a potential solution.
I think I'm running into this same issue, any progress to report @jrasell? It also seems that ServiceIDs that are registered by nomad can't be deregistered from consul using the agent API.
Just adding a note that this issue seems to be the same one described here https://github.com/hashicorp/nomad/issues/9813
We made this change in Consul https://github.com/hashicorp/consul/pull/16097, which should fix the case where the agent is unable to deregister a service or check because the service token was deleted (meaning the token from the token
field within the Consul service or check definition).
I haven't looked closely at the Nomad-Consul integration, but with this change, as long as Consul agent has an agent
token with node:write
permission for its Consul node, the service/check deregistrations should succeed (meaning the acl.tokens.agent
field is set on the client agent, or set with consul acl set-agent-token agent
command).
Doing a little issue cleanup. Closing as effectively a dupe of https://github.com/hashicorp/nomad/issues/20185
Nomad version
Output from
nomad version
consul version
Operating system and Environment details
debian 11
Issue
In my environment, I rotate nomad's consul ACL token every 7 days using
vault-agent
. Recently, we have noticed issues where consul services can't deregister when the nomad job moves to a new node.The below image shows the service running on
prod-do-sfo3-s-2vcpu-4gb-amd-nomad-client-b290
andprod-do-sfo3-s-2vcpu-4gb-amd-nomad-client-64a9t0
but its only actually running on
prod-do-sfo3-s-2vcpu-4gb-amd-nomad-client-b290
according to the nomad allocation.I think we're hitting https://github.com/hashicorp/consul/issues/9577 where the consul fails to deregister the service even though it's no longer running on the node because the ACL token used to register the service is deleted, but I am having a hard time reproducing it locally. Looking at the consul log from a different repro I see it sending similar message with an empty acessorID (which I'm assuming is because the ACL token was since deleted)
Someone mentioned in the issues a couple of ways to fix it (https://github.com/hashicorp/consul/issues/9577#issuecomment-771203024); however, I don't think this currently works for nomad as
Please let me know what information I can help provide to make sure this is the issue.
Reproduction steps
I created a setup with a
nomad agent -dev
andconsul agent -dev
, but when I refresh the token it deregisters the service properly. I am assuming this is because the consul agent is running as both server and client so it doesn't have ACL issues.repro.tar.gz
I will follow up when I figure out a good way to setup docker consul server and client to test it more thoroughly.
Job file (if appropriate)
See attached .tar.gz