Open anshitabharti opened 4 years ago
I've been having the same issue with Consul 1.6.2 running on k8s. I can do a rolling redeploy via a statefulset update and sometimes one of the nodes will show SerfStatus leaving and autopilot shows unhealthy. Like you said, only after I delete the container manually does it come back up again as healthy. Did you ever figure out what the problem was?
we are having the same issue with one of the consul members becomes leaving when the underlying node is terminated. I have to kill the pod manually to let it back to alive.
I am facing a similar issue, where one of the consul client see another as leaving while the latter is alive.
Hey @lwei-wish & @mssawant
May I ask which version(s) of consul y'all are running? It would also be helpful to any logs if you have them.
hi @Amier3, I am running version 1.9.1, so whenever I delete a pod running Consul client agent, on restart it just fails to resolve the node name to new ip address and all the other node sees this restarted pod as failed.
We have included leave_on_terminate
in configuration, still it seems it does not leave the cluster.
{
"enable_local_script_checks": true,
"leave_on_terminate": true,
Any help will be appreciated.
Hello!
It has been a while, I do not recollect which part exactly helped us solve the problem. I'm pasting the docker-compose and consul-config below if that helps.
compose:
`version: '2' services: {{ workload }}: network_mode: host build: args:
UID={{ ansible_user_uid }}
GID={{ ansible_user_gid }}
context: .
image: "{{ image_tag }}" container_name: {{ container_name }} hostname: "{{ ansible_host }}" ports:
config:
{ "datacenter": "{{ consul_datacenter }}", "bootstrap_expect": {{ bootstrap_expect }}, "advertise_addr": "{{ hostvars[inventory_hostname]['ansible_default_ipv4']['address'] }}", "client_addr": "{{ client_addr }}", "server": {{ is_server }}, "data_dir": "{{ consul_data_dir }}", "retry_join": [ "{{ groups[consul_host_group] | map('extract', hostvars, ['ansible_host']) | join("\", \"") }}" ], "encrypt": "{{ consul_encrypt }}", "log_level": "{{ consul_log_level }}", "enable_syslog": {{ consul_enable_syslog }}, "check_update_interval": "{{ consul_check_interval }}", "acl_datacenter":"{{ consul_datacenter }}", "acl_default_policy":"{{ acl_policy }}", "acl_down_policy":"{{ acl_down_policy }}", "acl_master_token":"{{ acl_master_token }}", "acl_agent_token": "{{ acl_agent_token }}", "performance": { "raft_multiplier": {{ raft_multiplier }} }, "gossip_lan": { "probe_timeout": "{{ probe_timeout }}", "probe_interval": "{{ probe_interval }}" } }
Thanks @anshitabharti , thought advertise
would help but no luck. Trying probe_timeout
, probe_interval
.
we are having the same issue with one of the consul client see another as leaving while the latter is alive. consul version: 1.6.2
Overview of the issue:
Sometimes one of the nodes SerfStatus is stuck as leaving state. Even though agent is started initially with retry-join, if it falls out of the cluster, it is unable to join back. When the node goes out the cluster, container is still up and running. To solve this issue the container has to be manually restarted which we want to avoid.
Even if just one of the node is in Leaving Status, the monitoring api v1/operator/autopilot/health responds with Healthy: false, even though all k/v operations can be executed without any issues. Because of Healthy: false the alerts kick in and creates panic if the cluster is actually unhealthy. What's the rationale behind considering the cluster unhealthy?
Consul version: 1.5.3, running inside docker containers, on openstack VMs.