hashicorp / nomad

Nomad is an easy-to-use, flexible, and performant workload orchestrator that can deploy a mix of microservice, batch, containerized, and non-containerized applications. Nomad is easy to operate and scale and has native Consul and Vault integrations.
https://www.nomadproject.io/
Other
14.87k stars 1.95k forks source link

consul services registered before ACL token regeneration become orphaned #13537

Closed kevinschoonover closed 3 months ago

kevinschoonover commented 2 years ago

Nomad version

Output from nomad version

nomad version
Nomad v1.3.1 (2b054e38e91af964d1235faa98c286ca3f527e56)

consul version

consul version
Consul v1.12.2
Revision 19041f20
Protocol 2 spoken by default, understands 2 to 3 (agent will automatically use protocol >2 when speaking to compatible agents)

Operating system and Environment details

debian 11

Issue

In my environment, I rotate nomad's consul ACL token every 7 days using vault-agent. Recently, we have noticed issues where consul services can't deregister when the nomad job moves to a new node.

The below image shows the service running on prod-do-sfo3-s-2vcpu-4gb-amd-nomad-client-b290 and prod-do-sfo3-s-2vcpu-4gb-amd-nomad-client-64a9t0 image

but its only actually running on prod-do-sfo3-s-2vcpu-4gb-amd-nomad-client-b290 according to the nomad allocation.

I think we're hitting https://github.com/hashicorp/consul/issues/9577 where the consul fails to deregister the service even though it's no longer running on the node because the ACL token used to register the service is deleted, but I am having a hard time reproducing it locally. Looking at the consul log from a different repro I see it sending similar message with an empty acessorID (which I'm assuming is because the ACL token was since deleted)

Jun 23 02:18:59 prod-do-sfo3-s-2vcpu-4gb-amd-nomad-client-614e consul[1450005]: agent: Service registration blocked by ACLs: service=_nomad-task-5d75ba91-bb59-b1a4-b210-817d7967981f-group-otel-collector-loki-otel-collector-http_zpages accessorID=
Jun 23 02:18:59 prod-do-sfo3-s-2vcpu-4gb-amd-nomad-client-614e consul[1450005]: agent.client: RPC failed to server: method=Catalog.Register server=10.124.0.7:8300 error="rpc error making call: ACL not found"
Jun 23 02:18:59 prod-do-sfo3-s-2vcpu-4gb-amd-nomad-client-614e consul[1450005]: agent: Service registration blocked by ACLs: service=_nomad-task-5d75ba91-bb59-b1a4-b210-817d7967981f-group-otel-collector-loki-otel-collector-jaeger_thrift_compact accessorID=
Jun 23 02:18:59 prod-do-sfo3-s-2vcpu-4gb-amd-nomad-client-614e consul[1450005]: agent.client: RPC failed to server: method=Catalog.Register server=10.124.0.2:8300 error="rpc error making call: rpc error making call: ACL not found"
Jun 23 02:18:59 prod-do-sfo3-s-2vcpu-4gb-amd-nomad-client-614e consul[1450005]: agent: Check registration blocked by ACLs: check=_nomad-check-860bc4eb85272d5ce538700823cfa3adc1798c8d accessorID=
Jun 23 02:18:59 prod-do-sfo3-s-2vcpu-4gb-amd-nomad-client-614e consul[1450005]: agent.client: RPC failed to server: method=Catalog.Register server=10.124.0.5:8300 error="rpc error making call: rpc error making call: ACL not found"
Jun 23 02:18:59 prod-do-sfo3-s-2vcpu-4gb-amd-nomad-client-614e consul[1450005]: agent: Check registration blocked by ACLs: check=_nomad-check-5e05156e6ebfff1f8e20c431b41542718dfa9a71 accessorID=
Jun 23 02:18:59 prod-do-sfo3-s-2vcpu-4gb-amd-nomad-client-614e consul[1450005]: agent.client: RPC failed to server: method=Catalog.Register server=10.124.0.7:8300 error="rpc error making call: ACL not found"
Jun 23 02:18:59 prod-do-sfo3-s-2vcpu-4gb-amd-nomad-client-614e consul[1450005]: agent: Check registration blocked by ACLs: check=_nomad-check-7d62e2e0ef1077d19c3f8e878b26683e6ef9c3ad accessorID=
Jun 23 02:18:59 prod-do-sfo3-s-2vcpu-4gb-amd-nomad-client-614e consul[1450005]: agent.client: RPC failed to server: method=Catalog.Register server=10.124.0.2:8300 error="rpc error making call: rpc error making call: ACL not found"
Jun 23 02:18:59 prod-do-sfo3-s-2vcpu-4gb-amd-nomad-client-614e consul[1450005]: agent: Check registration blocked by ACLs: check=_nomad-check-dd5b8de7600a412988d480fef02493ad11e24f54 accessorID=
Jun 23 02:18:59 prod-do-sfo3-s-2vcpu-4gb-amd-nomad-client-614e consul[1450005]: agent.client: RPC failed to server: method=Catalog.Register server=10.124.0.5:8300 error="rpc error making call: rpc error making call: ACL not found"
Jun 23 02:18:59 prod-do-sfo3-s-2vcpu-4gb-amd-nomad-client-614e consul[1450005]: agent: Check registration blocked by ACLs: check=_nomad-check-09d84ecf4ceaf576f684607532fbe93ff1b0bd78 accessorID=

Someone mentioned in the issues a couple of ways to fix it (https://github.com/hashicorp/consul/issues/9577#issuecomment-771203024); however, I don't think this currently works for nomad as

  1. I'm assuming it uses the agent's consul ACL token which will be deleted and as a result cause the service check to fail to deregister
  2. I don't know if nomad is using the catalog deregister API

Please let me know what information I can help provide to make sure this is the issue.

Reproduction steps

I created a setup with a nomad agent -dev and consul agent -dev, but when I refresh the token it deregisters the service properly. I am assuming this is because the consul agent is running as both server and client so it doesn't have ACL issues.

repro.tar.gz

I will follow up when I figure out a good way to setup docker consul server and client to test it more thoroughly.

Job file (if appropriate)

See attached .tar.gz

jrasell commented 2 years ago

Hi @kevinschoonover, could you clarify which version of Consul you are running please? It does seem to match with the Consul issue.

the agent's consul ACL token which will be deleted and as a result cause the service check to fail to deregister

From the best of my knowledge this is correct.

if nomad is using the catalog deregister API

Nomad utilises the agent API for service registration and deregistration.

The team do wish to make changes to the current Nomad Consul integration process, so I will add this to the backlog board as something to look into if/when we get around to this.

kevinschoonover commented 2 years ago

@jrasell I updated the issue with the version

consul version
Consul v1.12.2
Revision 19041f20
Protocol 2 spoken by default, understands 2 to 3 (agent will automatically use protocol >2 when speaking to compatible agents)

I am happy to contribute a fix if possible, I'm just not sure the design y'all would want. The two designs I can think of off the top of my head are:

  1. update all consul service registration when nomad is reloaded which preserves the current API, but is potentially very expensive
  2. use the catalog deregister API instead of the agent API which may may cause weird downstream effects

and am not really happy with either and I don't yet want to give up on rotating nomad's consul tokens. Not sure what kind of bandwidth is available to figure out a potential solution.

the-maldridge commented 1 year ago

I think I'm running into this same issue, any progress to report @jrasell? It also seems that ServiceIDs that are registered by nomad can't be deregistered from consul using the agent API.

ryanm-sq commented 1 year ago

Just adding a note that this issue seems to be the same one described here https://github.com/hashicorp/nomad/issues/9813

pglass commented 1 year ago

We made this change in Consul https://github.com/hashicorp/consul/pull/16097, which should fix the case where the agent is unable to deregister a service or check because the service token was deleted (meaning the token from the token field within the Consul service or check definition).

I haven't looked closely at the Nomad-Consul integration, but with this change, as long as Consul agent has an agent token with node:write permission for its Consul node, the service/check deregistrations should succeed (meaning the acl.tokens.agent field is set on the client agent, or set with consul acl set-agent-token agent command).

tgross commented 3 months ago

Doing a little issue cleanup. Closing as effectively a dupe of https://github.com/hashicorp/nomad/issues/20185