hashicorp / consul

Consul is a distributed, highly available, and data center aware solution to connect and configure applications across dynamic, distributed infrastructure.
https://www.consul.io
Other
28.22k stars 4.41k forks source link

Session with empty `NodeChecks` created from Consul Server is immediately lost if Consul Server loss network #20424

Open Vanav opened 7 months ago

Vanav commented 7 months ago

Overview of the Issue

If session is created from Consul Server with empty NodeChecks, and this Consul Server loss network, this session is immediately deleted. There is no such issue if session is created from Consul Client.

Details

Session documentation:

The contract that Consul provides is that under any of the following situations,
the session will be invalidated:
— Node is deregistered
— Any of the health checks are deregistered
— Any of the health checks go to the critical state
— Session is explicitly destroyed
— TTL expires, if applicable

None of this applies: node is not deregistered yet (just short network issues for ~5 s) and health checks are not failed (default serfHealth is not applied).

Also, this works correctly as documented if test is done from Consul Client, and session is not lost.

This case is important if we build pattern Application leader election with sessions on 3 servers. For example, place Consul Server and Patroni on each of 3 servers. It is expected that if connection is lost, the session will not be deleted immediately, but only after TTL is expired.

Why behavior differs if session is created from Consul Server? How to improve this behavior to respect specified session TTL?


Reproduction Steps

  1. Create a cluster with 3 Consul Servers.
  2. Create a session from Consul Server 1:
    
    curl -X PUT -d '{"Name": "testlock", "NodeChecks": [], "TTL": "5m"}' http://localhost:8500/v1/session/create

{"ID":"db870c93-b8d8-688a-a763-e259f0409be3"}

3. Check session:

Thu 1 Feb 00:20:57 2024

curl -s http://localhost:8500/v1/session/info/db870c93-b8d8-688a-a763-e259f0409be3

[{"ID":"db870c93-b8d8-688a-a763-e259f0409be3","Name":"testlock","Node":"db4", "LockDelay":15000000000,"Behavior":"release","TTL":"5m", "NodeChecks":[],"ServiceChecks":null,"CreateIndex":11465749, "ModifyIndex":11465749}]

4. Disconnect [network](https://developer.hashicorp.com/consul/docs/install/ports) on Consul Server 1:

Thu 1 Feb 00:21:04 2024

PORTS="8300,8301,8302,8500,8503,8600" && iptables-legacy -A INPUT -p tcp -m multiport --dports $PORTS -j DROP && iptables-legacy -A INPUT -p udp -m multiport --dports $PORTS -j DROP && iptables-legacy -A OUTPUT -p tcp -m multiport --dports $PORTS -j DROP && iptables-legacy -A OUTPUT -p udp -m multiport --dports $PORTS -j DROP

5. On Consul Server 2 check session — session is lost immediately:

Thu 1 Feb 00:21:22 2024

curl -s http://localhost:8500/v1/session/info/db870c93-b8d8-688a-a763-e259f0409be3

[]


### Consul info for both Client and Server

<details>
  <summary>Server info</summary>

consul info

agent: check_monitors = 0 check_ttls = 0 checks = 0 services = 0 build: prerelease = revision = 7736539d version = 1.17.2 version_metadata = consul: acl = disabled bootstrap = false known_datacenters = 1 leader = false leader_addr = xxx:8300 server = true raft: applied_index = 11466634 commit_index = 11466634 fsm_pending = 0 last_contact = 29.711728ms last_log_index = 11466634 last_log_term = 5883 last_snapshot_index = 11464290 last_snapshot_term = 5836 latest_configuration = [{Suffrage:Voter ID:eb490abe-4e5b-b953-154e-399bbd643be8 Address:xxxx:8300} {Suffrage:Voter ID:21e95163-2318-e3db-c9da-b0a621a1b3de Address:xxxx:8300} {Suffrage:Voter ID:913ffc1e-68c1-481f-ca7e-bd55d90ed79b Address:xxxx:8300} {Suffrage:Voter ID:54b5021d-2f7c-cd07-7da8-f4ad2dbb406a Address:xxxx:8300}] latest_configuration_index = 0 num_peers = 3 protocol_version = 3 protocol_version_max = 3 protocol_version_min = 0 snapshot_version_max = 1 snapshot_version_min = 0 state = Follower term = 5883 runtime: arch = amd64 cpu_count = 4 goroutines = 161 max_procs = 4 os = linux version = go1.21.6 serf_lan: coordinate_resets = 0 encrypted = true event_queue = 0 event_time = 227 failed = 0 health_score = 0 intent_queue = 0 left = 1 member_time = 21614 members = 6 query_queue = 0 query_time = 1 serf_wan: coordinate_resets = 0 encrypted = true event_queue = 0 event_time = 1 failed = 0 health_score = 0 intent_queue = 0 left = 1 member_time = 7598 members = 5 query_queue = 0 query_time = 1


Client agent HCL config:

data_dir = "/var/lib/consul" server = true bootstrap_expect = 3 bind_addr = "xxx" retry_join = [ "xxx", "xxx", "xxx", ] encrypt = "xxx"

</details>

### Operating system and Environment details

Ubuntu 22 LTS

### Log Fragments

<details>

Feb 01 00:21:05 db3 consul[3199803]: 2024-02-01T00:21:05.978+0300 [WARN] agent: error getting server health from server: server=db4 error="context deadline exceeded" Feb 01 00:21:07 db3 consul[3199803]: 2024-02-01T00:21:07.978+0300 [WARN] agent: error getting server health from server: server=db4 error="context deadline exceeded" Feb 01 00:21:09 db3 consul[3199803]: 2024-02-01T00:21:09.978+0300 [WARN] agent: error getting server health from server: server=db4 error="context deadline exceeded" Feb 01 00:21:10 db3 consul[3199803]: 2024-02-01T00:21:10.116+0300 [INFO] agent.server.memberlist.lan: memberlist: Marking db4 as failed, suspect timeout reached (2 peer confirmations) Feb 01 00:21:10 db3 consul[3199803]: 2024-02-01T00:21:10.117+0300 [INFO] agent.server.serf.lan: serf: EventMemberFailed: db4 xxx Feb 01 00:21:10 db3 consul[3199803]: 2024-02-01T00:21:10.117+0300 [INFO] agent.server: Removing LAN server: server="db4 (Addr: tcp/xxx:8300) (DC: dc1)" Feb 01 00:21:11 db3 consul[3199803]: 2024-02-01T00:21:11.978+0300 [WARN] agent: error getting server health from server: server=db4 error="context deadline exceeded" Feb 01 00:21:13 db3 consul[3199803]: 2024-02-01T00:21:13.978+0300 [WARN] agent: error getting server health from server: server=db4 error="context deadline exceeded" Feb 01 00:21:15 db3 consul[3199803]: 2024-02-01T00:21:15.219+0300 [INFO] agent.server.serf.lan: serf: EventMemberLeave (forced): db4 xxx Feb 01 00:21:15 db3 consul[3199803]: 2024-02-01T00:21:15.219+0300 [INFO] agent.server: Removing LAN server: server="db4 (Addr: tcp/xxx:8300) (DC: dc1)" Feb 01 00:21:15 db3 consul[3199803]: 2024-02-01T00:21:15.979+0300 [WARN] agent: error getting server health from server: server=db4 error="context deadline exceeded"


</details>

### Another discussions of this issue

<details>

- [failover not paying attention ttl · Issue #2868 · zalando/patroni](https://github.com/zalando/patroni/issues/2868#issuecomment-1722926442)
- [Consul: lock is lost on 7-10 s network issues · Issue #522 · zalando/patroni](https://github.com/zalando/patroni/issues/522)
- [Increase timeout before node is deleted and its locks are lost](https://groups.google.com/g/consul-tool/c/YMa4J40qgM0)
</details>
pratik-ja commented 2 months ago

@Vanav I am facing this even when i create a session from a consul agent. Did we find any workaround?