hashicorp / consul

Consul is a distributed, highly available, and data center aware solution to connect and configure applications across dynamic, distributed infrastructure.
https://www.consul.io
Other
28.41k stars 4.43k forks source link

Reopen #6998 - Consul agent not responding for a few mintes #7536

Open cshabi opened 4 years ago

cshabi commented 4 years ago

Overview of the Issue In our Kubernetes cluster, each node is running a consul agent, registrator container and some pods running production services. On some node at the beginning of each hour, theres a spike in disk utilisation for about 5-10 minutes due to files being zipped and sent to other nodes.

In some use cases, the disk utilisation reaches 100% and during that time consul is not responding at localhost:8500. We monitor consul agent at this endpoint every 5 secs: /v1/agent/services. During the utilisation spike the curl command returns 128 due to the connect timeout we had set (3 secs). This resolves by itself once disk utilisation is reduced.

Operating system and Environment details Servers and clients in this case are running Ubuntu16.04. Consul version is 1.2.1. Disk 0 and Disk 1 logically set as 1 disk (mirror). Consul's data dir is located within this logical disk. disk #1 is getting to 100% utilisation

Do you know why this effects the agent?

Additional data We do still experience this issue. the issue starts at 11:04 New Jersey time (it correlates to 17:04 in the graphs that are in Israel time)

Here are logs fro the agent: (logs are in new jersey time zone) These are stdout logs, stderr logfile is empty 2020/03/23 10:17:08 [ERR] memberlist: Failed fallback ping: write tcp :38523->:8301: i/o timeout 2020/03/23 10:17:54 [ERR] memberlist: Failed fallback ping: write tcp :48699->:8301: i/o timeout 2020/03/23 10:24:42 [WARN] agent: Check "service:-5864fcfc99-lsbwz" HTTP request failed: Get http://:8080//selftest?format=text: net/http: request canceled (Client.Timeout exceeded while awaiting headers) 2020/03/23 10:24:44 [WARN] agent: Check "service:-5864fcfc99-lsbwz" HTTP request failed: Get http://:8080//selftest?format=text: net/http: request canceled (Client.Timeout exceeded while awaiting headers) 2020/03/23 11:01:27 [WARN] agent: Check "service:-869b755585-t96s5" HTTP request failed: Get http://:8080//selftest?format=text: dial tcp :8080: connect: connection refused 2020/03/23 11:01:32 [WARN] agent: Check "service:-869b755585-t96s5" HTTP request failed: Get http://:8080//selftest?format=text: dial tcp :8080: connect: connection refused 2020/03/23 11:01:37 [WARN] agent: Check "service:-869b755585-t96s5" HTTP request failed: Get http://:8080//selftest?format=text: dial tcp :8080: connect: connection refused 2020/03/23 11:01:43 [WARN] agent: Check "service:-869b755585-t96s5" is now critical

and metrics:

Screen Shot 2020-03-23 at 17 25 10
pierresouchay commented 4 years ago

Disk usage at 100% is usually a deadly stuff for most programs, ie: stdout might be blocking for logs (so, not possible to log anything, that might block processing), unable to write anything and retiring indefinitely (ie: consul agent writes it state on disk periodically, might lock something somewhere).

In any case, running Damons with disk full is a recipe for disasters. For instance, ext4fs and xfs don't behave the same way when quota are full: in xfs, it blocks IO and does not return errors, while ext4fs do return error immediately

What are you partition types?

cshabi commented 4 years ago

hi @pierresouchay

Our fs type is ext4