hashicorp / consul

Consul is a distributed, highly available, and data center aware solution to connect and configure applications across dynamic, distributed infrastructure.
https://www.consul.io
Other
28.25k stars 4.41k forks source link

Restoring snapshots suddenly stopped working #4738

Closed far-blue closed 5 years ago

far-blue commented 5 years ago

Overview of the Issue

We've been testing a consul / vault setup for use in the company and have setup hourly snapshotting for backup. Testing these snapshots, we found that between the 22nd sept 9am and the 22nd sept 10am snapshots across all three nodes in the cluster suddenly started failing to restore.

The particular error is: Error restoring snapshot: Unexpected response code: 500 (failed to read snapshot file: failed checking integrity of snapshot: hash check failed for "meta.json")

Looking at the source code this suggests that the snap file was correctly inflated and unpacked and the files exist but that the calculated sha256 of the meta.json file doesn't match the hash in the SHA256SUMS file.

However, if we manually unpack the snap and use the sha256sum cli tool the checks pass and a new hash generated using the cli tool matches the content of the SHA256SUMS file.

Reproduction Steps

Just try to restore the snapshot like usual and it will fail (but one from an hour earlier succeeds).

Consul info for both Client and Server

agent:
    check_monitors = 0
    check_ttls = 0
    checks = 0
    services = 0
build:
    prerelease = 
    revision = 48d287ef
    version = 1.2.3
consul:
    bootstrap = false
    known_datacenters = 1
    leader = false
    leader_addr = 192.168.x.x:8300
    server = true
raft:
    applied_index = 100612
    commit_index = 100612
    fsm_pending = 0
    last_contact = 25.211601ms
    last_log_index = 100612
    last_log_term = 2
    last_snapshot_index = 100427
    last_snapshot_term = 2
    latest_configuration = [{Suffrage:Voter ID:21e2f522-ca4b-7683-243b-0c967c3b654d Address:192.168.x.x:8300} {Suffrage:Voter ID:fd4fbfa5-b1e0-0bf6-a4fa-46fb60f06185 Address:192.168.x.y:8300} {Suffrage:Voter ID:f516f0e5-9b90-5419-391e-75c9adf94a55 Address:192.168.x.z:8300}]
    latest_configuration_index = 1
    num_peers = 2
    protocol_version = 3
    protocol_version_max = 3
    protocol_version_min = 0
    snapshot_version_max = 1
    snapshot_version_min = 0
    state = Follower
    term = 2
runtime:
    arch = amd64
    cpu_count = 4
    goroutines = 78
    max_procs = 4
    os = linux
    version = go1.10.1
serf_lan:
    coordinate_resets = 0
    encrypted = false
    event_queue = 0
    event_time = 2
    failed = 0
    health_score = 0
    intent_queue = 0
    left = 0
    member_time = 4
    members = 3
    query_queue = 0
    query_time = 1
serf_wan:
    coordinate_resets = 0
    encrypted = false
    event_queue = 0
    event_time = 1
    failed = 0
    health_score = 0
    intent_queue = 0
    left = 0
    member_time = 6
    members = 3
    query_queue = 0
    query_time = 1

Operating system and Environment details

CentOS Linux release 7.5.1804 (Core)

far-blue commented 5 years ago

To be clear, older snapshots still restore without any issues.

pearkes commented 5 years ago

Thanks for reporting this.

Do newer snapshots restore successfully as well? Is it possible the snapshot process, filesystem, etc. was interrupted or modified in some way for that specific snapshot?

pearkes commented 5 years ago

I think this is a duplicate of https://github.com/hashicorp/consul/issues/4452. Please report any further information there, thanks!