Auto-config node attempted to re-auto-config unsuccessfully

mr-miles commented 2 years ago

Overview of the Issue

An agent node was set up with auto-config and successfully joined the cluster. It ran fine for more than a month.

At some point a networking issue meant that it became disconnected from the servers. When it went to reconnect it wasn't able to do so. It then appeared to run it's auto-config steps again but that failed because its auto-config token had expired.

It's not clear what in our logs is due to operator reboots and restarts and what is the agent behaviour. Please can you clarify what the auto-config behaviour is in cases of disconnection, and also if certificates or other target details have changed in that period? Happy to update the documentation pages accordingly (auto-config docs are quite light IMO)

One further specific question: we run the consul servers in an AWS auto-scaling group and use the discovery-by-tag features to look up the cluster. This means the IP addresses of the server nodes change. I noticed in the auto-config.json file in the data directory that the explicit ips of the server nodes are listed under RetryJoinLAN, rather than the auto-discovery details. Is it possible those are stale or are they not relevant at that point? Should both auto_config.server_addresses and retry_join config entries be specified in the main config file?

Consul info for both Client and Server

Client and server are both 1.11.3 Servers are running on linux (ubuntu 20.04); client is running on windows server 2019

{ "ca_file": "", "bind_addr": "0.0.0.0", "disable_update_check": true, "ports": { "https": 8501, "http": -1 }, "auto_config": { "intro_token_file": "", "server_addresses": [ "provider=aws tag_key= tag_value=" ], "enabled": true }, "verify_server_hostname": false, "verify_incoming": false, "log_file": "", "node_name": "", "verify_incoming_rpc": true, "server": false, "client_addr": "127.0.0.1", "verify_outgoing": true, "advertise_addr": "{{ GetInterfaceIP \"eth0\" }}", "datacenter": "", "data_dir": "", "ui_config": { "enabled": false } }

Amier3 commented 2 years ago

Hey @mr-miles

Hope you're doing well!

I feel your pain on the lack of documentation around this. To the first part of your issue ( unsuccessful re-join ), how are you managing the JWT token? Was it generated manually or through something like Vault? From my understanding auto-config is meant to be paired with a secrets manager that can refresh the JWT token, so if you still ended up with an expired token you might've encountered a bug or a misconfigured vault token.

To your second question, i'll have to ask around on the team and get more information on the behavior between auto-config & dynamic IPs in general.

mr-miles commented 2 years ago

Thanks - we generated the jwt on first boot via vault using EC2 credentials. but it has a short expiry as its used immediately to join the agent to the cluster. My reading of the docs was that once the agent joins and has retrieved the config, it has no further need of the jwt because it doesn't need to download the config again - is that not correct?

hashicorp / consul

Auto-config node attempted to re-auto-config unsuccessfully #12849

Overview of the Issue

Consul info for both Client and Server