Open lomori opened 4 years ago
This happens to me on 1.6.2 on Arm64. It was also happening on a different cluster that's on amd64 before I upgraded that cluster to 1.6.2, now all is quiet on that front.
Errors from the logs of the member which requested the snapshot:
2019/12/07 00:00:02 [ERR] http: Request GET /v1/snapshot, error: failed to stream snapshot: read tcp 172.xx.xx.xx:57649->172.xx.xx.xx:8300: read: connection reset by peer from=127.0.0.1:42446 http: Request GET /v1/snapshot, error: failed to stream snapshot: read tcp 172.xx.xx.xx:57649->172.xx.xx.xx:8300: read: connection reset by peer from=127.0.0.1:42446
The cluster leader's logs look normal:
2019/12/07 00:00:02 [INFO] consul.fsm: snapshot created in 182.885µs 2019/12/07 00:00:02 [INFO] raft: Starting snapshot up to 164365 2019/12/07 00:00:02 [INFO] snapshot: Creating new snapshot at /var/lib/consul/data/raft/snapshots/940-164365-1575705602005.tmp consul.fsm: snapshot created in 182.885µs raft: Starting snapshot up to 164365 snapshot: Creating new snapshot at /var/lib/consul/data/raft/snapshots/940-164365-1575705602005.tmp 2019/12/07 00:00:02 [INFO] snapshot: reaping snapshot /var/lib/consul/data/raft/snapshots/933-163151-1575698401708 snapshot: reaping snapshot /var/lib/consul/data/raft/snapshots/933-163151-1575698401708 2019/12/07 00:00:02 [INFO] raft: Compacting logs from 153522 to 154125 raft: Compacting logs from 153522 to 154125 2019/12/07 00:00:02 [INFO] raft: Snapshot to 164365 complete raft: Snapshot to 164365 complete
Something additional that's interesting. For the failed snapshots, every single one of them has the same exact size 32871
, a little over a hundred bytes over 32K.
Example:
-rw-r----- 1 consul consul 85640 Dec 6 14:00 20191206-1400.snap -rw-r----- 1 consul consul 32871 Dec 6 16:00 20191206-1600.snap -rw-r----- 1 consul consul 85635 Dec 6 18:00 20191206-1800.snap -rw-r----- 1 consul consul 32871 Dec 6 20:00 20191206-2000.snap -rw-r----- 1 consul consul 32871 Dec 6 21:00 20191206-2100.snap -rw-r----- 1 consul consul 32871 Dec 7 00:00 20191207-0000.snap
All the numbers around 85K are good snapshots, the rest are failures.
This does not seem to happen on the cluster leader, just the follower nodes.
Having the same issue on 1.7.1... here's the output of the same command ran 3 times consecutively
[root@server server]# CONSUL_HTTP_TOKEN=my-token ./bin/consul snapshot save consul.snapshot
Error verifying snapshot file: failed to read snapshot file: failed to read or write snapshot data: unexpected EOF
[root@server server]# CONSUL_HTTP_TOKEN=my-token ./bin/consul snapshot save consul.snapshot
Error saving snapshot: Unexpected response code: 500 (failed to decode response: read tcp 192.168.0.1:60444->192.168.0.2:8300: read: connection reset by peer)
[root@server server]# CONSUL_HTTP_TOKEN=my-token ./bin/consul snapshot save consul.snapshot
Saved and verified snapshot to index 23436477
It also happens on Consul v1.6.4
Hi all!
Just wanted to check anyone has seen this issue when trying to create snapshots since the release of v1.7.3, which includes the following pull request that address this error.
Hi all!
Closing this as I just tried to repro this behavior with v1.9.4 by running a watch command that saved a consul snapshot every 3 seconds for 20 minutes. But do feel to re-open if you see this behavior in any of the more recent Consul releases
For what it's worth this is occurring in k8s 1.19 with the consul-helm chart. Consul 1.9.3
ERROR] agent.http: Request error: method=GET url=/v1/snapshot from=10.200.90.8:51322 error="failed to stream snapshot: read tcp 10.200.87.97:48112->10.200.54.76:8300: read: connection reset by peer"
I'm seeing very similar behavior while running the snapshot in agent mode under Consul Enterprise 1.9.3
2021-06-16T15:12:05.528-0400 [INFO] snapshot.license: Automated Backups feature is now licensed
2021-06-16T15:12:05.532-0400 [INFO] license: Consul license updated
2021-06-16T15:12:05.554-0400 [INFO] snapshot: Waiting to obtain leadership...
2021-06-16T15:12:05.570-0400 [INFO] snapshot: Obtained leadership
2021-06-16T15:12:11.003-0400 [ERROR] snapshot: Snapshot failed (will retry at next interval): error="snapshot verification failed: failed to read snapshot file: failed to read or write snapshot data: unexpected EOF"
2021-06-16T16:06:59.728-0400 [WARN] snapshot: Lost leadership, stopping snapshots
2021-06-16T16:06:59.749-0400 [INFO] snapshot: Waiting to obtain leadership...
2021-06-16T16:06:59.750-0400 [ERROR] snapshot: Unable to use leader lock that was held previously and presumed lost, giving up the lock (will retry): error="Lock already held"
2021-06-16T16:06:59.752-0400 [ERROR] snapshot: Error trying to fetch lock key: error="Unexpected response code: 500"
we are using consul 1.8.0 as pods using the helm chart deployment and while performing snapshot save and restore, we got the same error
Error verifying snapshot file: failed to read snapshot file: failed to read or write snapshot data: unexpected EOF
is there a fix for this yet?
Also for the workaround, can anyone verify if taking snapshot from the leader solves the problem?
@ChipV223 please re-open this issue since this is not fixed.
Thanks for re-opening this @ChipV223!
This seems like a strange issue and persistent issue. I mentioned this issue to a few of the other engineers, and the recent activity on this being focused on consul in k8s leads us to believe it could be a PV issue...
Also, @benumbed noted that:
For the failed snapshots, every single one of them has the same exact size 32871, a little over a hundred bytes over 32K.
Which is making us wonder if this issue only occurs at this exact size ( or at a different, but consistent size ) for others. So my questions to @shomeprasanjit , @wsams , and @drewby08 would be:
@Amier3: Thanks for following up.
Here are your answers:
Are you using consul in containers? and if so how are you all handling PVs?
Ans) Yes we are running consul as pods as a standard helm chart. we are using gp3 EBS volumes as persistent storage. this is controlled by StatefulSet.
Are all the failed snapshots you all are seeing the exact same size in bytes?
Ans) not sure exactly, it was probably in MB's but definitely not in bytes.
When filing a bug, please include the following headings if possible. Any example text in this template can be deleted.
Overview of the Issue
I keep getting random errors when creating backups of our Consul.
$ consul snapshot save /consul/data/20191206120000-1.snap Error verifying snapshot file: failed to read snapshot file: failed to read or write snapshot data: unexpected EOF
Reproduction Steps
Just keep creating snapshots and you can see that it sometimes fails.
$ consul snapshot save /consul/data/20191206120000-1.snap Saved and verified snapshot to index 1068184 $ consul snapshot save /consul/data/20191206120000-1.snap Saved and verified snapshot to index 1068185 $ consul snapshot save /consul/data/20191206120000-1.snap Saved and verified snapshot to index 1068186 $ consul snapshot save /consul/data/20191206120000-1.snap Saved and verified snapshot to index 1068186 $ consul snapshot save /consul/data/20191206120000-1.snap Saved and verified snapshot to index 1068188 $ consul snapshot save /consul/data/20191206120000-1.snap Saved and verified snapshot to index 1068188 $ consul snapshot save /consul/data/20191206120000-1.snap Saved and verified snapshot to index 1068188 $ consul snapshot save /consul/data/20191206120000-1.snap Saved and verified snapshot to index 1068188 $ consul snapshot save /consul/data/20191206120000-1.snap Error verifying snapshot file: failed to read snapshot file: failed to read or write snapshot data: unexpected EOF $ consul snapshot save /consul/data/20191206120000-1.snap Saved and verified snapshot to index 1068194 $ consul snapshot save /consul/data/20191206120000-1.snap Saved and verified snapshot to index 1068195
Consul info for both Client and Server
$ consul version Consul v1.6.1 Protocol 2 spoken by default, understands 2 to 3 (agent will automatically use protocol >2 when speaking to compatible agents)
Client info
``` output from client 'consul info' command here ```Server info
``` agent: check_monitors = 0 check_ttls = 3 checks = 3 services = 3 build: prerelease = revision = 9be6dfc3 version = 1.6.1 consul: acl = disabled bootstrap = false known_datacenters = 1 leader = false leader_addr = 10.129.8.88:8300 server = true raft: applied_index = 1068318 commit_index = 1068318 fsm_pending = 0 last_contact = 10.52169ms last_log_index = 1068318 last_log_term = 332 last_snapshot_index = 1061622 last_snapshot_term = 332 latest_configuration = [{Suffrage:Voter ID:faea4c4e-2b87-83af-b831-9b3fe9b34c58 Address:10.131.12.242:8300} {Suffrage:Voter ID:a1b1991e-8f46-45b6-a09f-3c3cd1155bba Address:10.130.13.199:8300} {Suffrage:Voter ID:3261cc81-307f-c0be-3e7a-e224704bf2a3 Address:10.129.8.88:8300} {Suffrage:Voter ID:a12d5052-dc73-b1af-ed86-313e6c0d038c Address:10.130.6.140:8300} {Suffrage:Voter ID:14e82d1e-a6f2-eba8-d2ff-cc648a5daaac Address:10.129.12.38:8300}] latest_configuration_index = 1047983 num_peers = 4 protocol_version = 3 protocol_version_max = 3 protocol_version_min = 0 snapshot_version_max = 1 snapshot_version_min = 0 state = Follower term = 332 runtime: arch = amd64 cpu_count = 4 goroutines = 92 max_procs = 4 os = linux version = go1.12.1 serf_lan: coordinate_resets = 0 encrypted = true event_queue = 0 event_time = 19 failed = 0 health_score = 0 intent_queue = 0 left = 0 member_time = 137 members = 5 query_queue = 0 query_time = 1 serf_wan: coordinate_resets = 0 encrypted = true event_queue = 0 event_time = 1 failed = 0 health_score = 0 intent_queue = 0 left = 0 member_time = 193 members = 5 query_queue = 0 query_time = 1 ```Operating system and Environment details
OS, Architecture, and any other information you can provide about the environment.
Log Fragments
Include appropriate Client or Server log fragments. If the log is longer than a few dozen lines, please include the URL to the gist of the log instead of posting it in the issue. Use
-log-level=TRACE
on the client and server to capture the maximum log detail.