hashicorp / consul

Consul is a distributed, highly available, and data center aware solution to connect and configure applications across dynamic, distributed infrastructure.
https://www.consul.io
Other
28.31k stars 4.42k forks source link

ACL token rotation causes check status inconsistency #4372

Open danaps opened 6 years ago

danaps commented 6 years ago

Edit(banks)

Note that the thread concludes a root cause that may not apply to this original issue related to ACLs. See https://github.com/hashicorp/consul/issues/4372#issuecomment-404529391 for a summary of the issue this represents now


Hi folks, we run Consul 1.2.0 and encountered the following issue: in the consul UI, when looking at some node Health check we see it is failing.

`curl http://localhost:8500/v1/health/node/ip-10-1-7-170

{ "Node": "ip-10-1-7-170", "CheckID": "c49af2eedf67fdbf7135045051aad72c5f1d8a4c", "Name": "Nomad Client HTTP Check", "Status": "critical", "Notes": "", "Output": "", "ServiceID": "_nomad-client-vsvkabaz4izafp6d5ejwyhrurtl2pdeq", "ServiceName": "nomad-client", "ServiceTags": [ "http" ], "Definition": {}, "CreateIndex": 170787117, "ModifyIndex": 170787117 },

` after checking underlying API call /v1/agent/checks on the node the check is actually passing.

` curl http://localhost:8500/v1/agent/checks

"c49af2eedf67fdbf7135045051aad72c5f1d8a4c": { "Node": "ip-10-1-7-170", "CheckID": "c49af2eedf67fdbf7135045051aad72c5f1d8a4c", "Name": "Nomad Client HTTP Check", "Status": "passing", "Notes": "", "Output": "HTTP GET http://0.0.0.0:4646/v1/agent/health?type=client: 200 OK Output: {\"client\":{\"message\":\"ok\",\"ok\":true}}", "ServiceID": "_nomad-client-vsvkabaz4izafp6d5ejwyhrurtl2pdeq", "ServiceName": "nomad-client", "ServiceTags": [ "http" ], "Definition": {}, "CreateIndex": 0, "ModifyIndex": 0 }, `

burdandrei commented 6 years ago

Let me rephrase this: After running 1.2.0 in staging for a week, a week ago we updated our production cluster to 1.2.0. For monitoring, we use Telegraf that send the data to InfluxDB. After some time we noticed that in grafana we can see less healthy hosts, that are running. After all the debugging we found that some checks that we see in UI (Both official and Hashi-UI) as critical are shown as passing of we're checking it on the agent locally curl http://localhost:8500/v1/agent/checks. It looks like the UI is using "consensus" endpoint to show the status checks: curl http://localhost:8500/v1/health/node/:node After comparing the outputs we were able to see that the same check is healthy on agent level, and not healthy on cluster level.

Restarting the consul client updated and syncronises health on cluster level to reflect the real values. @preetapan @banks this looks like pretty major bug

banks commented 6 years ago

Thanks for the report @danaps and @burdandrei. Are you positive this is a change in behaviour with 1.2.0? Do you have any way to run 1.0.7 on your staging cluster for a day or so to confirm the behaviour goes away?

I ask not to dismiss it but to try to establish if it's a new issue - as far as I know nothing specifically changed in the code path related to agent's updating statuses with the servers. The one major change is the new UI became default and may present the data in a way that makes something that always happened more noticeable (wild speculation but an example of why we should clear up the link between version and bug).

Can you also provide some additional information to help us debug:

  1. How many clients/services/checks are you running?
  2. What percentage of checks are in this inconsistent state?
  3. Do checks that get into this state stay inconsistent indefinitely or "heal" themselves after 10 mins or an hour or a day?
  4. Is it all the same types of checks that are failing?
  5. Are you using ACLs with default deny? If yes are you using Vault or something else to rotate the tokens?
  6. Can you please upload the logs for your cluster leader and at least one agent that has checks in an inconsistent state? A gist works but if you don't want to share them publicly then we can provide a private email address to send them too.

Thanks!

banks commented 6 years ago

For reference for folks who aren't aware, the local agent state and global cluster state are kept in sync by a process we call "anti entropy" syncing. You can read more here: https://www.consul.io/docs/internals/anti-entropy.html#anti-entropy-1

Check status is in general only eventually consistent which is why I asked for details on whether it fixes itself although I assume not as it wasn't mentioned.

The issue is likely to be related to the AE sync mechanism failing which might be a bug but also might be caused by network issues, ACL token issues and several other factors some noted in docs. How long it takes to sync can be impacted by cluster size, number of services and checks running on local agent, available network connectivity etc.

burdandrei commented 6 years ago

Thanks for the answer @banks

  1. we got 158 clients in one DC (112 of them running 1.2.0), others are various old version (don't ask ;))
  2. I can't tell you, but we can see it only on nodes that we're using as nomad "shared" machines, means running a lot of services that are starting and stopping 2.1 part of the checks are ttl checks that we're setting with an application, and as I can see even when we deregister the application's service, because of checks where added separately, they are staying in the client with associated ServiceID, that is not on the agent anymore /shrug
  3. No they are staying in the wrong state forever
  4. Checks are not failing, most of them are HTTP checks that when we http://localhost:8500/v1/agent/checks they are shown as passing, but http://localhost:8500/v1/health/node/:node they are critical
  5. ACL default policy is Deny. we got couple of applications that are running with dynamic consul acls, but acl_agent_token is static
  6. sure, what logs? we just rotated all the servers to be on the safe side, so the log is pretty empty right now:
    Jul 11 08:54:06 ip-10-1-7-187 consul[9452]: ==> Starting Consul agent...
    Jul 11 08:54:06 ip-10-1-7-187 consul[9452]: ==> Consul agent running!
    Jul 11 08:54:06 ip-10-1-7-187 consul[9452]:            Version: 'v1.2.0'
    Jul 11 08:54:06 ip-10-1-7-187 consul[9452]:            Node ID: '722ee3f5-3aec-81d2-6d08-62414e5470ff'
    Jul 11 08:54:06 ip-10-1-7-187 consul[9452]:          Node name: 'ip-10-1-7-187'
    Jul 11 08:54:06 ip-10-1-7-187 consul[9452]:         Datacenter: 'us-east-1' (Segment: '<all>')
    Jul 11 08:54:06 ip-10-1-7-187 consul[9452]:             Server: true (Bootstrap: false)
    Jul 11 08:54:06 ip-10-1-7-187 consul[9452]:        Client Addr: [0.0.0.0] (HTTP: 8500, HTTPS: -1, DNS: 8600)
    Jul 11 08:54:06 ip-10-1-7-187 consul[9452]:       Cluster Addr: 10.1.7.187 (LAN: 8301, WAN: 8302)
    Jul 11 08:54:06 ip-10-1-7-187 consul[9452]:            Encrypt: Gossip: true, TLS-Outgoing: false, TLS-Incoming: false
    Jul 11 08:54:06 ip-10-1-7-187 consul[9452]: ==> Log data will now stream in as it occurs:
    Jul 11 08:54:07 ip-10-1-7-187 consul[9452]: 2018/07/11 08:54:07 [DEBUG] raft-net: 10.1.7.187:8300 accepted connection from: 10.1.1.152:20331
    Jul 11 08:54:07 ip-10-1-7-187 consul[9452]: WARNING: LAN keyring exists but -encrypt given, using keyring
    Jul 11 08:54:07 ip-10-1-7-187 consul[9452]: WARNING: WAN keyring exists but -encrypt given, using keyring
    Jul 11 08:54:07 ip-10-1-7-187 consul[9452]: 2018/07/11 08:54:07 [DEBUG] raft-net: 10.1.7.187:8300 accepted connection from: 10.1.1.152:20333
    Jul 11 08:54:08 ip-10-1-7-187 consul[9452]: 2018/07/11 08:54:08 [DEBUG] raft-net: 10.1.7.187:8300 accepted connection from: 10.1.1.152:20337
    Jul 11 08:55:20 ip-10-1-7-187 consul[9452]:     2018/07/11 08:55:20 [ERR] memberlist: Failed fallback ping: write tcp 10.1.7.187:27295->10.1.7.247:8301: i/o timeout
    Jul 11 08:55:20 ip-10-1-7-187 consul[9452]: memberlist: Failed fallback ping: write tcp 10.1.7.187:27295->10.1.7.247:8301: i/o timeout
    Jul 11 08:59:05 ip-10-1-7-187 consul[9452]:     2018/07/11 08:59:05 [ERR] yamux: Failed to write header: write tcp 10.1.7.187:8300->10.1.3.171:37593: use of closed network connection
    Jul 11 08:59:05 ip-10-1-7-187 consul[9452]: yamux: Failed to write header: write tcp 10.1.7.187:8300->10.1.3.171:37593: use of closed network connection
    Jul 11 09:08:09 ip-10-1-7-187 consul[9452]: 2018/07/11 09:08:09 [DEBUG] raft-net: 10.1.7.187:8300 accepted connection from: 10.1.7.171:20149
    Jul 11 09:19:58 ip-10-1-7-187 consul[9452]: 2018/07/11 09:19:58 [ERR] raft-net: Failed to decode incoming command: read tcp 10.1.7.187:8300->10.1.1.152:20333: read: connection reset by peer
    Jul 11 09:20:06 ip-10-1-7-187 consul[9452]: 2018/07/11 09:20:06 [DEBUG] raft-net: 10.1.7.187:8300 accepted connection from: 10.1.7.195:6020
    Jul 11 09:20:07 ip-10-1-7-187 consul[9452]: 2018/07/11 09:20:07 [DEBUG] raft-net: 10.1.7.187:8300 accepted connection from: 10.1.1.115:22475
    Jul 11 09:20:10 ip-10-1-7-187 consul[9452]: 2018/07/11 09:20:10 [DEBUG] raft-net: 10.1.7.187:8300 accepted connection from: 10.1.3.170:33638
    Jul 11 09:20:10 ip-10-1-7-187 consul[9452]:     2018/07/11 09:20:10 [ERR] http: Request GET /v1/kv/vault/core/cluster/local/info, error: No cluster leader from=127.0.0.1:24123
    Jul 11 09:20:10 ip-10-1-7-187 consul[9452]: http: Request GET /v1/kv/vault/core/cluster/local/info, error: No cluster leader from=127.0.0.1:24123
    Jul 11 09:20:11 ip-10-1-7-187 consul[9452]: 2018/07/11 09:20:11 [DEBUG] raft-net: 10.1.7.187:8300 accepted connection from: 10.1.3.251:9630

Our production is really not happy now =(

burdandrei commented 6 years ago

and why 1.0.7? we had 1.1.0 running for a pretty long time in prod.

banks commented 6 years ago

and why 1.0.7? we had 1.1.0 running for a pretty long time in prod.

ENOTENOUGHCOFFEE :D yeah I just meant previous version.

For logs, the most useful ones would be on the agent where the checks are failing - I suspect the issue is that it's not able to sync it's state with servers and it should log about why.

Checks are not failing

All checks when registered start in critical state until the succeed first time so it's possible the initial registration is working but then subsequent updates are not

ACL default policy is Deny. we got couple of applications that are running with dynamic consul acls, but acl_agent_token is static

Interesting, I ask that because anti-entropy requires that the token the application registers with remains valid otherwise it breaks anti-entropy. It might not be the cause but it would def show in logs if it is. We don't do a great job of documenting this or how to work with rotating tokens currently - it's on the near-term list to rework.

If you can post client agent (with "failing" checks) logs for the period in which they appeared as failing that would be great.

Also, do the checks ever show as passing in the Catalog? i.e. do they show as critical the entire life once registered or are they passing for some time and then flip to critical and never recover?

burdandrei commented 6 years ago

I just ran over the cluster and updated all the cluster to 1.2.0 =) failing checks are remaining critical in catalog sometimes even after the client restart. checking further with the consistent cluster

burdandrei commented 6 years ago

Just checked and I saw machine with agent/server discrepancy are fixed after a couple of minutes. Will continue to monitor and update you. P.S. @banks forgot to tell you- your talk on HashiDays was Awesome =)

banks commented 6 years ago

Interesting. Keep us posted.

Thanks for the compliment @burdandrei ❤️Always nice to get good feedback 😄.

burdandrei commented 6 years ago

Hi again! =) Today's news: I found a service that was deregistered from the agent, and when I'm querying /agent/services endpoint I can't see it, there is no check in /agent/checks endpoint too. I can see the node in /catalog/service/:service-name endpoint as passing, I can see checks in /health/checks/:service-name as passing as well as in /health/node/:node

After running some consul monitor --log-lovel=debug i can see a lot of these:

2018/07/12 09:35:17 [ERR] consul: "Catalog.Deregister" RPC failed to server 10.1.3.170:8300: rpc error making call: rpc error making call: ACL not found
2018/07/12 09:35:17 [WARN] agent: Deregistering service "hive-cache-master:10.1.7.93:29367" failed. rpc error making call: rpc error making call: ACL not found
2018/07/12 09:35:17 [ERR] agent: failed to sync changes: rpc error making call: rpc error making call: ACL not found
2018/07/12 09:35:19 [ERR] consul: "Catalog.Deregister" RPC failed to server 10.1.7.187:8300: rpc error making call: ACL not found
2018/07/12 09:35:19 [WARN] agent: Deregistering service "bronto-redis-slave:10.1.7.93:29550" failed. rpc error making call: ACL not found
2018/07/12 09:35:19 [ERR] agent: failed to sync changes: rpc error making call: ACL not found
2018/07/12 09:35:19 [ERR] consul: "Catalog.Deregister" RPC failed to server 10.1.3.251:8300: rpc error making call: rpc error making call: ACL not found
2018/07/12 09:35:19 [WARN] agent: Deregistering service "bronto-redis-slave:10.1.7.93:29550" failed. rpc error making call: rpc error making call: ACL not found
2018/07/12 09:35:19 [ERR] agent: failed to sync changes: rpc error making call: rpc error making call: ACL not found
2018/07/12 09:35:20 [ERR] consul: "Catalog.Deregister" RPC failed to server 10.1.7.195:8300: rpc error making call: rpc error making call: ACL not found
2018/07/12 09:35:20 [WARN] agent: Deregistering service "bronto-redis-slave:10.1.7.93:29550" failed. rpc error making call: rpc error making call: ACL not found
2018/07/12 09:35:20 [ERR] agent: failed to sync changes: rpc error making call: rpc error making call: ACL not found
2018/07/12 09:35:22 [ERR] consul: "Catalog.Deregister" RPC failed to server 10.1.1.115:8300: rpc error making call: rpc error making call: ACL not found
2018/07/12 09:35:22 [WARN] agent: Deregistering service "bronto-redis-slave:10.1.7.93:29550" failed. rpc error making call: rpc error making call: ACL not found
2018/07/12 09:35:22 [ERR] agent: failed to sync changes: rpc error making call: rpc error making call: ACL not found
2018/07/12 09:35:24 [ERR] consul: "Catalog.Deregister" RPC failed to server 10.1.3.170:8300: rpc error making call: rpc error making call: ACL not found
2018/07/12 09:35:24 [WARN] agent: Deregistering service "bronto-redis-slave:10.1.7.93:29550" failed. rpc error making call: rpc error making call: ACL not found
2018/07/12 09:35:24 [ERR] agent: failed to sync changes: rpc error making call: rpc error making call: ACL not found
2018/07/12 09:35:24 [ERR] consul: "Catalog.Deregister" RPC failed to server 10.1.7.187:8300: rpc error making call: ACL not found
2018/07/12 09:35:24 [WARN] agent: Deregistering service "bronto-redis-slave:10.1.7.93:29550" failed. rpc error making call: ACL not found
2018/07/12 09:35:24 [ERR] agent: failed to sync changes: rpc error making call: ACL not found
2018/07/12 09:35:25 [ERR] consul: "Catalog.Deregister" RPC failed to server 10.1.3.251:8300: rpc error making call: rpc error making call: ACL not found
2018/07/12 09:35:25 [WARN] agent: Deregistering service "hive-cache-master:10.1.7.93:29367" failed. rpc error making call: rpc error making call: ACL not found
2018/07/12 09:35:25 [ERR] agent: failed to sync changes: rpc error making call: rpc error making call: ACL not found
2018/07/12 09:35:27 [ERR] consul: "Catalog.Deregister" RPC failed to server 10.1.7.195:8300: rpc error making call: rpc error making call: ACL not found
2018/07/12 09:35:27 [WARN] agent: Deregistering service "hive-cache-master:10.1.7.93:29367" failed. rpc error making call: rpc error making call: ACL not found
2018/07/12 09:35:27 [ERR] agent: failed to sync changes: rpc error making call: rpc error making call: ACL not found
2018/07/12 09:35:29 [ERR] consul: "Catalog.Deregister" RPC failed to server 10.1.1.115:8300: rpc error making call: rpc error making call: ACL not found
2018/07/12 09:35:29 [WARN] agent: Deregistering service "bronto-redis-slave:10.1.7.93:29550" failed. rpc error making call: rpc error making call: ACL not found
2018/07/12 09:35:29 [ERR] agent: failed to sync changes: rpc error making call: rpc error making call: ACL not found
2018/07/12 09:35:29 [ERR] consul: "Catalog.Deregister" RPC failed to server 10.1.3.170:8300: rpc error making call: rpc error making call: ACL not found
2018/07/12 09:35:29 [WARN] agent: Deregistering service "bronto-redis-slave:10.1.7.93:29550" failed. rpc error making call: rpc error making call: ACL not found
2018/07/12 09:35:29 [ERR] agent: failed to sync changes: rpc error making call: rpc error making call: ACL not found
2018/07/12 09:35:30 [ERR] consul: "Catalog.Deregister" RPC failed to server 10.1.7.187:8300: rpc error making call: ACL not found
2018/07/12 09:35:30 [WARN] agent: Deregistering service "hive-cache-master:10.1.7.93:29367" failed. rpc error making call: ACL not found
2018/07/12 09:35:30 [ERR] agent: failed to sync changes: rpc error making call: ACL not found
2018/07/12 09:35:32 [ERR] consul: "Catalog.Deregister" RPC failed to server 10.1.3.251:8300: rpc error making call: rpc error making call: ACL not found
2018/07/12 09:35:32 [WARN] agent: Deregistering service "bronto-redis-slave:10.1.7.93:29550" failed. rpc error making call: rpc error making call: ACL not found
2018/07/12 09:35:32 [ERR] agent: failed to sync changes: rpc error making call: rpc error making call: ACL not found
2018/07/12 09:35:34 [ERR] consul: "Catalog.Deregister" RPC failed to server 10.1.7.195:8300: rpc error making call: rpc error making call: ACL not found
2018/07/12 09:35:34 [WARN] agent: Deregistering service "bronto-redis-slave:10.1.7.93:29550" failed. rpc error making call: rpc error making call: ACL not found
2018/07/12 09:35:34 [ERR] agent: failed to sync changes: rpc error making call: rpc error making call: ACL not found
2018/07/12 09:35:34 [ERR] consul: "Catalog.Deregister" RPC failed to server 10.1.1.115:8300: rpc error making call: rpc error making call: ACL not found
2018/07/12 09:35:34 [WARN] agent: Deregistering service "hive-cache-master:10.1.7.93:29367" failed. rpc error making call: rpc error making call: ACL not found
2018/07/12 09:35:34 [ERR] agent: failed to sync changes: rpc error making call: rpc error making call: ACL not found
2018/07/12 09:35:35 [ERR] consul: "Catalog.Deregister" RPC failed to server 10.1.3.170:8300: rpc error making call: rpc error making call: ACL not found
2018/07/12 09:35:35 [WARN] agent: Deregistering service "bronto-redis-slave:10.1.7.93:29550" failed. rpc error making call: rpc error making call: ACL not found
2018/07/12 09:35:35 [ERR] agent: failed to sync changes: rpc error making call: rpc error making call: ACL not found
2018/07/12 09:35:37 [ERR] consul: "Catalog.Deregister" RPC failed to server 10.1.7.187:8300: rpc error making call: ACL not found
2018/07/12 09:35:37 [WARN] agent: Deregistering service "hive-cache-master:10.1.7.93:29367" failed. rpc error making call: ACL not found
2018/07/12 09:35:37 [ERR] agent: failed to sync changes: rpc error making call: ACL not found
2018/07/12 09:35:39 [ERR] consul: "Catalog.Deregister" RPC failed to server 10.1.3.251:8300: rpc error making call: rpc error making call: ACL not found
2018/07/12 09:35:39 [WARN] agent: Deregistering service "hive-cache-master:10.1.7.93:29367" failed. rpc error making call: rpc error making call: ACL not 

Interesting moment: normally in our cluster nomad is the one who registers/deregisters the services in Consul WITHOUT token (Using anonymous token). These services and check where registered with resec - redis sidecar that runs HA redis based on Consul Session locks. Because it has to aquire lock and write to KV it runs with the following ACL:

key "resec/" {
  policy = "write"
}
session "" {
  policy = "write"
}
service "" {
  policy = "write"
}

The token that he receives is temporrary generated by Vault. So maybe there is a correlation to the thing you said about Anti Entropy. @banks feel free to write/chat/email/call me for further details, cause now it's a bit of deal breaker. @jippi I know you're running resec too, but without ACLs, have you seen stuck services in the cluster?

burdandrei commented 6 years ago

UPD: after updating ALL the instances in the cluster to 1.2.0 and then draining and terminating all the instances that had not synced services checks I can say that for the last 3 hours everything looks OK. In the monitoring, we can see that we had a couple of issues with consul leader running ~100% CPU, that was caused by https://github.com/hashicorp/vault/issues/4143 . After bringing everything to normal low cpu/low memory/ low network everything looks ok.

banks commented 6 years ago

Thanks @burdandrei. Sounds like my ACL hunch was right.

This is a known issue (at least I hypothesised it existed several months ago, first time we've seen it bite in the wild) and it has been this way sicne forever - 1.2.0 upgrade must have been bad timing.

The issue is that Vault backend expires tokens (i.e. deleted them from Consul) but the agent that you registered with doesn't know anything about that currently and so it just stops being able to perform antientroy. The only thing it can do here is log that error.

The "fix" if you want to use Vault correctly (which is not documented anywhere and one of the reasons we are going to be doing a lot of work on ACLs and Vault integration soon) is that whenever Vault expires your ACL token, your app (or whatever registered your app in catalog) needs to re-register itself with the new token.

@danaps I'm going to rename this issue to reflect the real problem @burdandrei has since there has now been a lot of context about his case. Can you confirm whether ACLs being deleted is also the cause of the issue for you?

If not, do you mind opening a separate issue with the same info (including why you think it's not ACLs and the agent that is "failing" logs) so we can keep track of what you are seeing please?

burdandrei commented 6 years ago

@banks WOW! We will need to rethink how to run resec now. Sounds like static ACL can be a solution. And @dpascal and I are talking about the same issue. We're working in the same company and I asked her to open the issue

burdandrei commented 5 years ago

Sorry for necroposting @banks, but should this still be an issue in 1.4 or we can start using the dynamic tokens for registering/deregistering services?

banks commented 5 years ago

@burdandrei the underlying issue here is the same in 1.4. There's no real way to fix it at a fundamental level - if we do something magic and auto-rotate tokens or something so that it never fails then it effectively defies the security benefit you get from rotating since an attacker could use the same mechanism to prolong the life of their stolen token too.

So we prefer to keep it explicit. One thing we've not done yet is fix the docs to make this more clear. There is work going on due to be released in the next month or two that has the potential to help with this issue, but ultimately it still needs the application (or some other tooling) to actively manage the rotation which means re-registering with the new token when it rotates. In some common cases we hope to eventually provide tooling that can be used to do that so you don't have to solve the issue by hand, but it will necessarily be one opinionated way to run things and won't fit everyone's workflow. Better docs on how to do this yourself will come as part of that though.

The short answer is that 1.4.0 ACLs don't really change this: you still need to have something re-register the application instance with it's local agent with a new token before the old one is revoked.

jippi commented 5 years ago

@banks Thanks for the answer. Is this something you and the Nomad team is working on improving as well? Based on what @burdandrei told me, Nomad do not handle this - either at all, or not nicely / seamless

burdandrei commented 5 years ago

Thanks for answer @banks, we're just thinking of the right way to distribute consul tokens to nomad clients. Do I understand right that if we generate the token per nomad client and it's renewed during all the instance lifecycle we're in a good place, cause the token is scoped to the consul agent where it registers the services?

banks commented 5 years ago

Nomad currently only registers services at start of a task and removes during termination. That means if the token rotates in the interim, Nomad won't re-register for you currently leaving the Consul agent unable to make further changes to the registration.

The good news is that we are working with the Nomad team to improve the overall integration and the registration and token UX will be part of that.

On Mon, Feb 18, 2019 at 11:28 AM Andrei Burd notifications@github.com wrote:

Thanks for answer @banks https://github.com/banks, we're just thinking of the right way to distribute consul tokens to nomad clients. Do I understand right that if we generate the token per nomad client and it's renewed during all the instance lifecycle we're in a good place, cause the token is scoped to the consul agent where it registers the services?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/hashicorp/consul/issues/4372#issuecomment-464696148, or mute the thread https://github.com/notifications/unsubscribe-auth/AAHYU5CmJTm-LyTXWUq6BKzC4G1mu9TEks5vOo5dgaJpZM4VJcXy .

danlsgiga commented 5 years ago

I'm facing the very same issue after I integrated Vault to generate on the fly Consul ACL tokens. In our case, we update the acl_default_token via the cli command: consul acl set-agent-token default xxxx. But after Vault rotates the Consul ACL and we inject a the new Consul ACL token via the method above, Nomad calls starts to fail with the same errors and we start to see duplicate entries in Consul for the Catalog since Nomad was unable to deregister the previous task allocation health checks in Consul

[ERR] consul: "Catalog.Deregister" RPC failed to server 1.2.3.4:8300: rpc error making call: rpc error making call: ACL not found

Restarting the consul agent seems to solve the issue until our out-of-band process generate new Consul ACL tokens via Vault.

Update: Restart the consul agent is not sufficient in this case... we have to restart the nomad client as well.

danlsgiga commented 5 years ago

I've done some tests to validate the Nomad behaviour once a new ACL token is injected to the Consul agent and the previous one is revoked and I noticed the following:

So, it seems that Nomad only tries to refresh its "ACL cache" after the first failure and is able to get things sorted after that.

Does anyone else see the same behaviour in your environments? @banks would that be a possible nomad behaviour in this scenario?

UPDATE: That seems to apply to each Nomad job / Service though.

banks commented 5 years ago

@banks https://github.com/banks would that be a possible nomad behaviour in this scenario?

Did you mean a possible Consul behaviour? I'll assume so as I think what you describe is accurate for Nomad to best of my knowledge.

It's more subtle than that in Consul sadly.

In Nomad case, Nomad itself is the "application" that is authenticated to Vault and can fetch and renew ACL tokens for Consul actively or passively on failures as you described.

But Consul can't just renew tokens magically with Vault since it is not directly authenticate to vault - the application is and is pushing in the token.

The only correct way to handle this currently is that the application (or tooling around it like Nomad/another scheduler/helper daemon etc) needs to actively manage the token in vault and renew it and then re-register with Consul.

As I mentioned we have plans to expand the new Auth method system in 1.5.0 to provide a full alternative workflow that works different to side step this problem and allows rotation etc to make this much easier over the next few months but I don't think there are any quick fixes to the existing ACL flow that will make this just work - it needs to be driven by the application externally to Consul that is interacting with Vault in the current model.

For example one could write some external tooling - perhaps even using consul-template - that manages the token via vault actively on behalf of your app and then pushes the updated registration with the new token on each change. That could be deployed as a "side car" alongside app instances and make this all work today. The eventual solution will probably look a lot like this anyway (possibly built into consul binary though not separate tooling) although will simultaneously solve the issue of managing tokens and policies more simply even without Vault.

Does that help?

On Mon, May 13, 2019 at 7:18 PM Daniel Santos notifications@github.com wrote:

I've done some tests to validate the Nomad behaviour once a new ACL token is injected to the Consul agent and the previous one is revoked and I noticed the following:

  • After the ACL tokes are recycled, the first call for updates by Nomad fail with the ACL not found errors.
  • The only side-effects remaining after are errors for [ERR] consul: "Catalog.Deregister" RPC failed to server 1.2.3.4:8300: rpc error making call: rpc error making call: Unknown check '82501dafc1680911b77851756cc2e6aab4fa084c'
  • The ACL not found errors will only happen when vault recycle the Consul ACL token again.

So, it seems that Nomad only tries to refresh its "ACL cache" after the first failure and is able to get things sorted after that.

Does anyone else see the same behaviour in your environments? @banks https://github.com/banks would that be a possible nomad behaviour in this scenario?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/hashicorp/consul/issues/4372?email_source=notifications&email_token=AAA5QU3ONDKMUZX4DCFKGGDPVGWFNA5CNFSM4FJFYXZKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODVJEKEA#issuecomment-491930896, or mute the thread https://github.com/notifications/unsubscribe-auth/AAA5QU3UNS5BTCPA72H6QILPVGWFNANCNFSM4FJFYXZA .

danlsgiga commented 5 years ago

Sure does... looking forward for the enhancements to come. Right now we have a single systemd job to handle these issues in a scheduled way, so we should be covered for now.

Thanks for the detailed feedback.

elcomtik commented 2 years ago

The "fix" if you want to use Vault correctly (which is not documented anywhere and one of the reasons we are going to be doing a lot of work on ACLs and Vault integration soon) is that whenever Vault expires your ACL token, your app (or whatever registered your app in catalog) needs to re-register itself with the new token.

@banks any news about the documentation of proper usage? We have several clusters, which are experiencing this issue occasionally.

Our setup consists of the cluster with vault, consul, and nomad servers. Nomad client nodes are accompanied by vault agent and consul agent. Vault agent templates nomad and consul configuration with ACL tokens. Vault agent actively renews tokens, which seems to work OK. Occasionally it stops syncing service instances to the consul catalog. Nomad client restart seems to fix this, however, I would like to be able to prevent such an issue.

EDIT: added component versions consul: 1.9.4 vault: 1.6.2 nomad: 1.0.4

schmichael commented 2 years ago

Hi @elcomtik, Nomad engineer here :wave:

Just to clarify: are you saying that Vault agent re-templates Nomad's agent configuration file with a new Consul Token, SIGHUP's Nomad, and sometimes Nomad fails to use the new token when registering Consul services? If so this sounds like a bug with Nomad's SIGHUP handling. Please file an issue over at: https://github.com/hashicorp/nomad/issues/new?assignees=&labels=type%2Fbug&template=bug_report.md

burdandrei commented 2 years ago

@elcomtik correct me if I'm wrong. Vault agent is running and providing Consul ACL for Nomad agent. Nomad agent is starting and running as usual registering the services using the provided Consul token. Things go wrong when Consul ACL token reaches Max TTL and gets rotated because for Consul to handle it correct everything registered with the initial token should be de-registered and re-registered using the new dynamic token. From technical point of view, Nomad agent can listen to USR1 or USR2 to handle this specific case, but it's up to you to decide, @schmichael

elcomtik commented 2 years ago

@elcomtik correct me if I'm wrong. Vault agent is running and providing Consul ACL for Nomad agent. Nomad agent is starting and running as usual registering the services using the provided Consul token. Things go wrong when Consul ACL token reaches Max TTL and gets rotated because for Consul to handle it correct everything registered with the initial token should be de-registered and re-registered using the new dynamic token. From technical point of view, Nomad agent can listen to USR1 or USR2 to handle this specific case, but it's up to you to decide, @schmichael

Yes, the vault agent is providing an ACL token for the nomad agent. I suppose that it goes wrong at max TTL or failed renewal. Then it needs to be rotated by a vault agent and reloaded by a nomad agent.

Vault agent templates it and does restart by

 command     = "systemctl restart nomad-client"

Systemd service is inspired by https://learn.hashicorp.com/tutorials/nomad/production-deployment-guide-vm-with-consul with little customization.

So currently I'm sending SIGINT and restarting, @schmichael mentioned it should be SIGHUPed, you @burdandrei are recommending USR1 or USR2 signals. Are they equivalent are some of them is better?

burdandrei commented 2 years ago

@elcomtik USR1 or USR2 are not implemented in nomad right now. Nomad agent restart is not enough because by default it's not draining/de-registering anything running on the node.

Here's how I see the flow for this:

elcomtik commented 2 years ago

OK, should I try to change systemctl restart for systemctl reload? I might to test it and give you feedback when it's done.

elcomtik commented 2 years ago

@elcomtik USR1 or USR2 are not implemented in nomad right now. Nomad agent restart is not enough because by default it's not draining/de-registering anything running on the node.

@burdandrei few questions emerged:

elcomtik commented 2 years ago

@burdandrei I'm just testing this approach with SIGHUP signal. Unfortunately, I came into this bug https://github.com/hashicorp/nomad/issues/3885.

I'm reloading it by vault-agent service, soI wanted to patch this by defining systemd service order. However, this is useless because systemd unit for nomad is using Type simple, which doesn't wait for service startup. I tried to use types forking and notify, however, nomad doesn't support them also.

There is an issue that is aiming for a similar feature https://github.com/hashicorp/nomad/issues/4466, but it seems to be stale for a long time.

I will patch this with some sleep before starting vault-agent, however, I will raise another issue in nomad repo to address this.

elcomtik commented 2 years ago

For anyone interested, I resolved this issue with Nomad not supporting systemd service types forking or notify by using configuration Restart=always instead of Restart=on-failure. Not clean, but it works for me.

ninjadude333 commented 1 year ago

hi, any updates ?

jkirschner-hashicorp commented 1 year ago

I believe this may be fixed by https://github.com/hashicorp/consul/pull/16097, which will be included in the upcoming Consul 1.16 release.

CC @pglass

pglass commented 1 year ago

https://github.com/hashicorp/consul/pull/16097 was included in Consul 1.15, which is released. That change helps with failed deregistrations of checks and services, which now use the agent token (always) so deregistration should "just work" if the agent token is valid.

However, that change doesn't help with token expirations causing failed registrations and updates of checks and services. The agent remembers the token used to register a check (or service). After that token expires, subsequent anti-entropy updates to the Consul servers for that check (or service) will fail due the expired token. (The local Consul agent will know the state of the check, but that won't be reflected in the Consul catalog.)