hashicorp / consul

Consul is a distributed, highly available, and data center aware solution to connect and configure applications across dynamic, distributed infrastructure.
https://www.consul.io
Other
28.25k stars 4.41k forks source link

Error restoring snapshot: Unexpected response code: 500 (unknown CA provider "") #5016

Closed PurrBiscuit closed 5 years ago

PurrBiscuit commented 5 years ago

We recently upgraded consul from 1.2.1 to 1.2.4 and just started to see these errors in our consul-snapshot pipeline during a restore.

/tmp # /bin/consul --version
Consul v1.2.4
Protocol 2 spoken by default, understands 2 to 3 (agent will automatically use protocol >2 when speaking to compatible agents)
/tmp # /bin/consul snapshot restore backup-1543442482247.snap
Error restoring snapshot: Unexpected response code: 500 (unknown CA provider "")

Steps to reproduce:

docker run -d --rm --volume /tmp:/tmp --entrypoint /bin/consul articulate/consul-agent:1.2 agent -dev
docker exec -it <container_id> sh
/tmp # /bin/consul snapshot restore backup-1543442482247.snap
/ # consul info
agent:
    check_monitors = 0
    check_ttls = 0
    checks = 0
    services = 0
build:
    prerelease =
    revision = 739949eb
    version = 1.2.4
consul:
    bootstrap = false
    known_datacenters = 1
    leader = true
    leader_addr = 127.0.0.1:8300
    server = true
raft:
    applied_index = 10
    commit_index = 10
    fsm_pending = 0
    last_contact = 0
    last_log_index = 10
    last_log_term = 2
    last_snapshot_index = 0
    last_snapshot_term = 0
    latest_configuration = [{Suffrage:Voter ID:8249423f-33e5-0d89-df9c-03ed2ba6475c Address:127.0.0.1:8300}]
    latest_configuration_index = 1
    num_peers = 0
    protocol_version = 3
    protocol_version_max = 3
    protocol_version_min = 0
    snapshot_version_max = 1
    snapshot_version_min = 0
    state = Leader
    term = 2
runtime:
    arch = amd64
    cpu_count = 4
    goroutines = 76
    max_procs = 4
    os = linux
    version = go1.10.1
serf_lan:
    coordinate_resets = 0
    encrypted = false
    event_queue = 1
    event_time = 2
    failed = 0
    health_score = 0
    intent_queue = 0
    left = 0
    member_time = 1
    members = 1
    query_queue = 0
    query_time = 1
serf_wan:
    coordinate_resets = 0
    encrypted = false
    event_queue = 0
    event_time = 1
    failed = 0
    health_score = 0
    intent_queue = 0
    left = 0
    member_time = 1
    members = 1
    query_queue = 0
    query_time = 1
pearkes commented 5 years ago

Thanks for the Docker image for reproduction and for reporting.

PurrBiscuit commented 5 years ago

@pearkes just checking in to see if there was anything more found out about this and if it appears to be a bug or something we can workaround for our restores. I did see in the linked issue that someone else had tried a restore and it failed with the same error we're seeing as well. We're in a vulnerable spot with not being able to run a restore on our 1.2.4 Consul cluster should we need to.

pearkes commented 5 years ago

@PurrBiscuit this should be fixed by https://github.com/hashicorp/consul/pull/5061. If you want to build that branch and give it a try with your repro that'd be great.

pearkes commented 5 years ago

Note that if you're holding off on upgrading to 1.4.0 you could cherry-pick that commit back to 1.2.3+.

$ git fetch/clone/etc
$ git checkout v1.2.3
$ git cherry-pick 9f7e53f97657f4f190e10245541571a29a57ffec
$ make linux
PurrBiscuit commented 5 years ago

@pearkes this fix isn't going to be making it to the 1.2 versions of consul? we could get upgraded to 1.4.0 in anticipation of this patch if it'll be limited to 1.4 versions. any idea what patch version this is being targeted for?

banks commented 5 years ago

@PurrBiscuit the fix will be in 1.4.1.

This is a pretty tiny patch so the cherry-pick steps @pearkes gave would give a relatively easy way to test out without doing the full upgrade. We don't have a firm date for 1.4.1. yet so you may need to build from master anyway once this lands to test this out in the next week or so.

In general 1.4.0 was the first release where Connect was not a Beta feature and there are lots more stability and performance improvements to come for Connect in 1.4.x so I'd recommend getting there sooner rather than later if using Connect.

PurrBiscuit commented 5 years ago

Thanks a lot for the quick response on this to everyone involved. We'll be getting upgraded to 1.4 across the board for our consul clusters. It's not a terribly urgent thing for us but if we do need to do a restore to our cluster we can do the steps outlined above to get a cluster with restore capabilities back up.

pearkes commented 5 years ago

Fixed in https://github.com/hashicorp/consul/pull/5061! If you see further issues please let us know @PurrBiscuit.