hashicorp / consul

Consul is a distributed, highly available, and data center aware solution to connect and configure applications across dynamic, distributed infrastructure.
https://www.consul.io
Other
28.37k stars 4.42k forks source link

Consul Node stuck at Leaving status #6882

Open anshitabharti opened 4 years ago

anshitabharti commented 4 years ago

Overview of the issue:

  1. Sometimes one of the nodes SerfStatus is stuck as leaving state. Even though agent is started initially with retry-join, if it falls out of the cluster, it is unable to join back. When the node goes out the cluster, container is still up and running. To solve this issue the container has to be manually restarted which we want to avoid.

  2. Even if just one of the node is in Leaving Status, the monitoring api v1/operator/autopilot/health responds with Healthy: false, even though all k/v operations can be executed without any issues. Because of Healthy: false the alerts kick in and creates panic if the cluster is actually unhealthy. What's the rationale behind considering the cluster unhealthy?

Consul version: 1.5.3, running inside docker containers, on openstack VMs.

KalenWessel commented 4 years ago

I've been having the same issue with Consul 1.6.2 running on k8s. I can do a rolling redeploy via a statefulset update and sometimes one of the nodes will show SerfStatus leaving and autopilot shows unhealthy. Like you said, only after I delete the container manually does it come back up again as healthy. Did you ever figure out what the problem was?

lwei-wish commented 2 years ago

we are having the same issue with one of the consul members becomes leaving when the underlying node is terminated. I have to kill the pod manually to let it back to alive.

mssawant commented 2 years ago

I am facing a similar issue, where one of the consul client see another as leaving while the latter is alive.

Amier3 commented 2 years ago

Hey @lwei-wish & @mssawant

May I ask which version(s) of consul y'all are running? It would also be helpful to any logs if you have them.

mssawant commented 2 years ago

hi @Amier3, I am running version 1.9.1, so whenever I delete a pod running Consul client agent, on restart it just fails to resolve the node name to new ip address and all the other node sees this restarted pod as failed. We have included leave_on_terminate in configuration, still it seems it does not leave the cluster.

{
  "enable_local_script_checks": true,
  "leave_on_terminate": true,

Any help will be appreciated.

anshitabharti commented 2 years ago

Hello!

It has been a while, I do not recollect which part exactly helped us solve the problem. I'm pasting the docker-compose and consul-config below if that helps.

compose:

`version: '2' services: {{ workload }}: network_mode: host build: args:

config:

{ "datacenter": "{{ consul_datacenter }}", "bootstrap_expect": {{ bootstrap_expect }}, "advertise_addr": "{{ hostvars[inventory_hostname]['ansible_default_ipv4']['address'] }}", "client_addr": "{{ client_addr }}", "server": {{ is_server }}, "data_dir": "{{ consul_data_dir }}", "retry_join": [ "{{ groups[consul_host_group] | map('extract', hostvars, ['ansible_host']) | join("\", \"") }}" ], "encrypt": "{{ consul_encrypt }}", "log_level": "{{ consul_log_level }}", "enable_syslog": {{ consul_enable_syslog }}, "check_update_interval": "{{ consul_check_interval }}", "acl_datacenter":"{{ consul_datacenter }}", "acl_default_policy":"{{ acl_policy }}", "acl_down_policy":"{{ acl_down_policy }}", "acl_master_token":"{{ acl_master_token }}", "acl_agent_token": "{{ acl_agent_token }}", "performance": { "raft_multiplier": {{ raft_multiplier }} }, "gossip_lan": { "probe_timeout": "{{ probe_timeout }}", "probe_interval": "{{ probe_interval }}" } }

mssawant commented 2 years ago

Thanks @anshitabharti , thought advertise would help but no luck. Trying probe_timeout, probe_interval.

chymy commented 2 years ago

we are having the same issue with one of the consul client see another as leaving while the latter is alive. consul version: 1.6.2