Consul Snapshot fails with error "failed to open snapshot"

guessmyname commented 5 years ago

Overview of the Issue

We inadvertently joined a new clusters to an existing cluster that wound up causing the existing cluster to fail. We were able to recover using peers.json file. After recovery we are no longer able to perform snapshots of the existing cluster. We get the following error: snapshot: Failed to get meta data to open snapshot: open /etc/consul.d/raft/snapshots/6-69538013-1571760497270/meta.json: no such file or directory

Reproduction Steps

Change client that is using an existing cluster to point to a new cluster with same datacenter name. This will cause new cluster to become aware of old cluster and attempt to connect. Afterwards use peers.json to recover old cluster. Once recovered try to do snapshot using command consul snapshot save backup.snap

Consul info for both Client and Server

Client info

``` agent: check_monitors = 0 check_ttls = 0 checks = 0 services = 0 build: prerelease = revision = 9a494b5f version = 1.0.6 consul: bootstrap = false known_datacenters = 1 leader = false leader_addr = 10.19.0.168:8300 server = true raft: applied_index = 69559359 commit_index = 69559359 fsm_pending = 0 last_contact = 58.379511ms last_log_index = 69559359 last_log_term = 6 last_snapshot_index = 69554429 last_snapshot_term = 6 latest_configuration = [{Suffrage:Voter ID:a6ad78c2-79d4-4472-242a-fe01382ca52c Address:10.19.88.163:8300} {Suffrage:Voter ID:05b8c3a7-5fa3-16f8-688e-986cd1e36266 Address:10.19.41.179:8300} {Suffrage:Voter ID:558d6953-3104-1122-35d5-021526a2cea1 Address:10.19.0.168:8300} {Suffrage:Voter ID:05d20dfe-8454-6513-de2c-1279bcfc6f7b Address:10.19.42.133:8300} {Suffrage:Voter ID:2ebe1869-8c13-9875-a771-de3b02de7c90 Address:10.19.66.1:8300}] latest_configuration_index = 68927085 num_peers = 4 protocol_version = 3 protocol_version_max = 3 protocol_version_min = 0 snapshot_version_max = 1 snapshot_version_min = 0 state = Follower term = 6 runtime: arch = amd64 cpu_count = 2 goroutines = 946 max_procs = 2 os = linux version = go1.9.3 serf_lan: coordinate_resets = 0 encrypted = true event_queue = 0 event_time = 293 failed = 258 health_score = 3 intent_queue = 0 left = 96 member_time = 2263245 members = 565 query_queue = 0 query_time = 1 serf_wan: coordinate_resets = 0 encrypted = true event_queue = 0 event_time = 1 failed = 0 health_score = 0 intent_queue = 0 left = 0 member_time = 787 members = 5 query_queue = 0 query_time = 1 ```

Server info

``` agent: check_monitors = 0 check_ttls = 0 checks = 0 services = 0 build: prerelease = revision = 9a494b5f version = 1.0.6 consul: bootstrap = false known_datacenters = 1 leader = false leader_addr = 10.19.0.168:8300 server = true raft: applied_index = 69559142 commit_index = 69559142 fsm_pending = 0 last_contact = 49.017142ms last_log_index = 69559143 last_log_term = 6 last_snapshot_index = 69554429 last_snapshot_term = 6 latest_configuration = [{Suffrage:Voter ID:a6ad78c2-79d4-4472-242a-fe01382ca52c Address:10.19.88.163:8300} {Suffrage:Voter ID:05b8c3a7-5fa3-16f8-688e-986cd1e36266 Address:10.19.41.179:8300} {Suffrage:Voter ID:558d6953-3104-1122-35d5-021526a2cea1 Address:10.19.0.168:8300} {Suffrage:Voter ID:05d20dfe-8454-6513-de2c-1279bcfc6f7b Address:10.19.42.133:8300} {Suffrage:Voter ID:2ebe1869-8c13-9875-a771-de3b02de7c90 Address:10.19.66.1:8300}] latest_configuration_index = 68927085 num_peers = 4 protocol_version = 3 protocol_version_max = 3 protocol_version_min = 0 snapshot_version_max = 1 snapshot_version_min = 0 state = Follower term = 6 runtime: arch = amd64 cpu_count = 2 goroutines = 908 max_procs = 2 os = linux version = go1.9.3 serf_lan: coordinate_resets = 0 encrypted = true event_queue = 0 event_time = 293 failed = 266 health_score = 0 intent_queue = 0 left = 115 member_time = 2263229 members = 565 query_queue = 0 query_time = 1 serf_wan: coordinate_resets = 0 encrypted = true event_queue = 0 event_time = 1 failed = 0 health_score = 0 intent_queue = 0 left = 0 member_time = 787 members = 5 query_queue = 0 query_time = 1 ```

Operating system and Environment details

5 Node Cluster Red Hat Enterprise Linux Server release 7.2 (Maipo)

Log Fragments

2019/10/22 12:08:17 [INFO] consul.fsm: snapshot created in 24.758µs 2019/10/22 12:08:17 [INFO] raft: Starting snapshot up to 69538013 2019/10/22 12:08:17 [INFO] snapshot: Creating new snapshot at /etc/consul.d/raft/snapshots/6-69538013-1571760497270.tmp 2019/10/22 12:08:17 [INFO] snapshot: reaping snapshot /etc/consul.d/raft/snapshots/6-69538013-1571760497270 2019/10/22 12:08:17 [INFO] raft: Compacting logs from 69526779 to 69527774 2019/10/22 12:08:17 [INFO] raft: Snapshot to 69538013 complete 2019/10/22 12:08:17 [ERR] snapshot: Failed to get meta data to open snapshot: open /etc/consul.d/raft/snapshots/6-69538013-1571760497270/meta.json: no such file or directory 2019/10/22 12:08:17 [ERR] http: Request GET /v1/snapshot?stale=, error: failed to open snapshot: open /etc/consul.d/raft/snapshots/6-69538013-1571760497270/meta.json: no such file or directory: from=10.19.21.14:56174

schristoff commented 5 years ago

Howdy @guessmyname , thank you so much for bringing this to our attention. I have a few questions to solidify my understanding of the issue.

Is the problem/bug you're focused on mostly related to the fact when you move a client to another datacenter with the same datacenter name it shares information about the old cluster, or rather that you cannot perform a snapshot after recovering?

If it is focused on the snapshot aspect - is the peers.json you are restoring the client with from the original datacenter or the new datacenter?

Any further replication steps, logs, or links to Github repositories are appreciated. :)

stale[bot] commented 4 years ago

Hey there, This issue has been automatically closed because there hasn't been any activity for at least 90 days. If you are still experiencing problems, or still have questions, feel free to open a new one :+1:

ghost commented 4 years ago

Hey there,

This issue has been automatically locked because it is closed and there hasn't been any activity for at least 30 days.

If you are still experiencing problems, or still have questions, feel free to open a new one :+1:.

hashicorp / consul