Sporadic error when saving snapshots: unexpected EOF

lomori commented 4 years ago

When filing a bug, please include the following headings if possible. Any example text in this template can be deleted.

Overview of the Issue

I keep getting random errors when creating backups of our Consul.

$ consul snapshot save /consul/data/20191206120000-1.snap Error verifying snapshot file: failed to read snapshot file: failed to read or write snapshot data: unexpected EOF

Reproduction Steps

Just keep creating snapshots and you can see that it sometimes fails.

$ consul snapshot save /consul/data/20191206120000-1.snap Saved and verified snapshot to index 1068184 $ consul snapshot save /consul/data/20191206120000-1.snap Saved and verified snapshot to index 1068185 $ consul snapshot save /consul/data/20191206120000-1.snap Saved and verified snapshot to index 1068186 $ consul snapshot save /consul/data/20191206120000-1.snap Saved and verified snapshot to index 1068186 $ consul snapshot save /consul/data/20191206120000-1.snap Saved and verified snapshot to index 1068188 $ consul snapshot save /consul/data/20191206120000-1.snap Saved and verified snapshot to index 1068188 $ consul snapshot save /consul/data/20191206120000-1.snap Saved and verified snapshot to index 1068188 $ consul snapshot save /consul/data/20191206120000-1.snap Saved and verified snapshot to index 1068188 $ consul snapshot save /consul/data/20191206120000-1.snap Error verifying snapshot file: failed to read snapshot file: failed to read or write snapshot data: unexpected EOF $ consul snapshot save /consul/data/20191206120000-1.snap Saved and verified snapshot to index 1068194 $ consul snapshot save /consul/data/20191206120000-1.snap Saved and verified snapshot to index 1068195

Consul info for both Client and Server

$ consul version Consul v1.6.1 Protocol 2 spoken by default, understands 2 to 3 (agent will automatically use protocol >2 when speaking to compatible agents)

Client info

``` output from client 'consul info' command here ```

Server info

``` agent: check_monitors = 0 check_ttls = 3 checks = 3 services = 3 build: prerelease = revision = 9be6dfc3 version = 1.6.1 consul: acl = disabled bootstrap = false known_datacenters = 1 leader = false leader_addr = 10.129.8.88:8300 server = true raft: applied_index = 1068318 commit_index = 1068318 fsm_pending = 0 last_contact = 10.52169ms last_log_index = 1068318 last_log_term = 332 last_snapshot_index = 1061622 last_snapshot_term = 332 latest_configuration = [{Suffrage:Voter ID:faea4c4e-2b87-83af-b831-9b3fe9b34c58 Address:10.131.12.242:8300} {Suffrage:Voter ID:a1b1991e-8f46-45b6-a09f-3c3cd1155bba Address:10.130.13.199:8300} {Suffrage:Voter ID:3261cc81-307f-c0be-3e7a-e224704bf2a3 Address:10.129.8.88:8300} {Suffrage:Voter ID:a12d5052-dc73-b1af-ed86-313e6c0d038c Address:10.130.6.140:8300} {Suffrage:Voter ID:14e82d1e-a6f2-eba8-d2ff-cc648a5daaac Address:10.129.12.38:8300}] latest_configuration_index = 1047983 num_peers = 4 protocol_version = 3 protocol_version_max = 3 protocol_version_min = 0 snapshot_version_max = 1 snapshot_version_min = 0 state = Follower term = 332 runtime: arch = amd64 cpu_count = 4 goroutines = 92 max_procs = 4 os = linux version = go1.12.1 serf_lan: coordinate_resets = 0 encrypted = true event_queue = 0 event_time = 19 failed = 0 health_score = 0 intent_queue = 0 left = 0 member_time = 137 members = 5 query_queue = 0 query_time = 1 serf_wan: coordinate_resets = 0 encrypted = true event_queue = 0 event_time = 1 failed = 0 health_score = 0 intent_queue = 0 left = 0 member_time = 193 members = 5 query_queue = 0 query_time = 1 ```

Operating system and Environment details

OS, Architecture, and any other information you can provide about the environment.

Log Fragments

Include appropriate Client or Server log fragments. If the log is longer than a few dozen lines, please include the URL to the gist of the log instead of posting it in the issue. Use -log-level=TRACE on the client and server to capture the maximum log detail.

benumbed commented 4 years ago

This happens to me on 1.6.2 on Arm64. It was also happening on a different cluster that's on amd64 before I upgraded that cluster to 1.6.2, now all is quiet on that front.

Errors from the logs of the member which requested the snapshot:

2019/12/07 00:00:02 [ERR] http: Request GET /v1/snapshot, error: failed to stream snapshot: read tcp 172.xx.xx.xx:57649->172.xx.xx.xx:8300: read: connection reset by peer from=127.0.0.1:42446 http: Request GET /v1/snapshot, error: failed to stream snapshot: read tcp 172.xx.xx.xx:57649->172.xx.xx.xx:8300: read: connection reset by peer from=127.0.0.1:42446

The cluster leader's logs look normal:

2019/12/07 00:00:02 [INFO] consul.fsm: snapshot created in 182.885µs
2019/12/07 00:00:02 [INFO]  raft: Starting snapshot up to 164365
2019/12/07 00:00:02 [INFO] snapshot: Creating new snapshot at /var/lib/consul/data/raft/snapshots/940-164365-1575705602005.tmp
 consul.fsm: snapshot created in 182.885µs
 raft: Starting snapshot up to 164365
 snapshot: Creating new snapshot at /var/lib/consul/data/raft/snapshots/940-164365-1575705602005.tmp
2019/12/07 00:00:02 [INFO] snapshot: reaping snapshot /var/lib/consul/data/raft/snapshots/933-163151-1575698401708
 snapshot: reaping snapshot /var/lib/consul/data/raft/snapshots/933-163151-1575698401708
2019/12/07 00:00:02 [INFO]  raft: Compacting logs from 153522 to 154125
  raft: Compacting logs from 153522 to 154125
2019/12/07 00:00:02 [INFO]  raft: Snapshot to 164365 complete
 raft: Snapshot to 164365 complete

Something additional that's interesting. For the failed snapshots, every single one of them has the same exact size 32871, a little over a hundred bytes over 32K.

Example:

-rw-r----- 1 consul consul 85640 Dec 6 14:00 20191206-1400.snap -rw-r----- 1 consul consul 32871 Dec 6 16:00 20191206-1600.snap -rw-r----- 1 consul consul 85635 Dec 6 18:00 20191206-1800.snap -rw-r----- 1 consul consul 32871 Dec 6 20:00 20191206-2000.snap -rw-r----- 1 consul consul 32871 Dec 6 21:00 20191206-2100.snap -rw-r----- 1 consul consul 32871 Dec 7 00:00 20191207-0000.snap

All the numbers around 85K are good snapshots, the rest are failures.

This does not seem to happen on the cluster leader, just the follower nodes.

danlsgiga commented 4 years ago

Having the same issue on 1.7.1... here's the output of the same command ran 3 times consecutively

[root@server server]# CONSUL_HTTP_TOKEN=my-token ./bin/consul snapshot save consul.snapshot
Error verifying snapshot file: failed to read snapshot file: failed to read or write snapshot data: unexpected EOF
[root@server server]# CONSUL_HTTP_TOKEN=my-token ./bin/consul snapshot save consul.snapshot
Error saving snapshot: Unexpected response code: 500 (failed to decode response: read tcp 192.168.0.1:60444->192.168.0.2:8300: read: connection reset by peer)
[root@server server]# CONSUL_HTTP_TOKEN=my-token ./bin/consul snapshot save consul.snapshot
Saved and verified snapshot to index 23436477

agaudreault commented 4 years ago

It also happens on Consul v1.6.4

ChipV223 commented 3 years ago

Hi all!

Just wanted to check anyone has seen this issue when trying to create snapshots since the release of v1.7.3, which includes the following pull request that address this error.

ChipV223 commented 3 years ago

Hi all!

Closing this as I just tried to repro this behavior with v1.9.4 by running a watch command that saved a consul snapshot every 3 seconds for 20 minutes. But do feel to re-open if you see this behavior in any of the more recent Consul releases

wsams commented 3 years ago

For what it's worth this is occurring in k8s 1.19 with the consul-helm chart. Consul 1.9.3

ERROR] agent.http: Request error: method=GET url=/v1/snapshot from=10.200.90.8:51322 error="failed to stream snapshot: read tcp 10.200.87.97:48112->10.200.54.76:8300: read: connection reset by peer"

drewby08 commented 3 years ago

I'm seeing very similar behavior while running the snapshot in agent mode under Consul Enterprise 1.9.3

2021-06-16T15:12:05.528-0400 [INFO]  snapshot.license: Automated Backups feature is now licensed
2021-06-16T15:12:05.532-0400 [INFO]  license: Consul license updated
2021-06-16T15:12:05.554-0400 [INFO]  snapshot: Waiting to obtain leadership...
2021-06-16T15:12:05.570-0400 [INFO]  snapshot: Obtained leadership
2021-06-16T15:12:11.003-0400 [ERROR] snapshot: Snapshot failed (will retry at next interval): error="snapshot verification failed: failed to read snapshot file: failed to read or write snapshot data: unexpected EOF"
2021-06-16T16:06:59.728-0400 [WARN]  snapshot: Lost leadership, stopping snapshots
2021-06-16T16:06:59.749-0400 [INFO]  snapshot: Waiting to obtain leadership...
2021-06-16T16:06:59.750-0400 [ERROR] snapshot: Unable to use leader lock that was held previously and presumed lost, giving up the lock (will retry): error="Lock already held"
2021-06-16T16:06:59.752-0400 [ERROR] snapshot: Error trying to fetch lock key: error="Unexpected response code: 500"

shomeprasanjit commented 2 years ago

we are using consul 1.8.0 as pods using the helm chart deployment and while performing snapshot save and restore, we got the same error

Error verifying snapshot file: failed to read snapshot file: failed to read or write snapshot data: unexpected EOF

is there a fix for this yet?

Also for the workaround, can anyone verify if taking snapshot from the leader solves the problem?

@ChipV223 please re-open this issue since this is not fixed.

Amier3 commented 2 years ago

Thanks for re-opening this @ChipV223!

This seems like a strange issue and persistent issue. I mentioned this issue to a few of the other engineers, and the recent activity on this being focused on consul in k8s leads us to believe it could be a PV issue...

Also, @benumbed noted that:

For the failed snapshots, every single one of them has the same exact size 32871, a little over a hundred bytes over 32K.

Which is making us wonder if this issue only occurs at this exact size ( or at a different, but consistent size ) for others. So my questions to @shomeprasanjit , @wsams , and @drewby08 would be:

Are you using consul in containers? and if so how are you all handling PVs?
Are all the failed snapshots you all are seeing the exact same size in bytes?

shomeprasanjit commented 2 years ago

@Amier3: Thanks for following up.

Here are your answers:

Are you using consul in containers? and if so how are you all handling PVs?

Ans) Yes we are running consul as pods as a standard helm chart. we are using gp3 EBS volumes as persistent storage. this is controlled by StatefulSet.

Are all the failed snapshots you all are seeing the exact same size in bytes?

Ans) not sure exactly, it was probably in MB's but definitely not in bytes.

hashicorp / consul