hashicorp / consul

Consul is a distributed, highly available, and data center aware solution to connect and configure applications across dynamic, distributed infrastructure.
https://www.consul.io
Other
28.39k stars 4.43k forks source link

consul downgrade: Failed to load any existing snapshots and panic error when starting the consul agents #11430

Open evilin13 opened 3 years ago

evilin13 commented 3 years ago

Overview of the Issue

When downgrading from consul version 1.8.15 to 1.4.0, none of the three servers can start. The following log is printed : "Failed to start Consul server: Failed to start Raft: failed to load any existing snapshots"

I removed the snapshot directories under the 3 consul data directories and restarted the server agents.

[root@infra-2 someuser]# ls -al /mnt/cinder-consul/raft/
peers.info  raft.db     snapshots/  
[root@infra-2 someuser]# ls -al /mnt/cinder-consul/raft/snapshots/
total 0
drwx------. 2 537 537  6 Oct 26 15:46 .
drwx------. 3 537 537 56 Oct 26 15:29 ..
[root@infra-2 someuser]# 

Now there is a panic error when starting the agents:

{"type":"log","level":"notice","time":"2021-10-26T13:39:42.085377412Z","process":"consul[97]","service":"consul","system":"tas01","neid":"123456","container":"8debfecebc3f","host":"tas01-infra-1","timezone":"CEST","log":{"msg":"==> Starting Consul agent..."}}
{"type":"log","level":"info","time":"2021-10-26T13:39:42.090934014Z","process":"unknown","service":"consul","system":"tas01","neid":"123456","container":"8debfecebc3f","host":"tas01-infra-1","timezone":"CEST","log":{"msg":"panic: log not found"}}
{"type":"log","level":"info","time":"2021-10-26T13:39:42.090946940Z","process":"unknown","service":"consul","system":"tas01","neid":"123456","container":"8debfecebc3f","host":"tas01-infra-1","timezone":"CEST","log":{"msg":""}}
{"type":"log","level":"info","time":"2021-10-26T13:39:42.090950136Z","process":"unknown","service":"consul","system":"tas01","neid":"123456","container":"8debfecebc3f","host":"tas01-infra-1","timezone":"CEST","log":{"msg":"goroutine 1 [running]:"}}
{"type":"log","level":"info","time":"2021-10-26T13:39:42.090954546Z","process":"unknown","service":"consul","system":"tas01","neid":"123456","container":"8debfecebc3f","host":"tas01-infra-1","timezone":"CEST","log":{"msg":"github.com\/hashicorp\/consul\/vendor\/github.com\/hashicorp\/raft.NewRaft(0xc00036ea20, 0x2c37e20, 0xc0005b14a0, 0x2c57700, 0xc00038ff80, 0x2c4d760, 0xc00001f4a0, 0x2c3ef60, 0xc00001f600, 0x2c62620, ...)"}}
{"type":"log","level":"info","time":"2021-10-26T13:39:42.090958885Z","process":"unknown","service":"consul","system":"tas01","neid":"123456","container":"8debfecebc3f","host":"tas01-infra-1","timezone":"CEST","log":{"msg":"\t\/home\/developer\/rpmbuild\/BUILD\/src\/github.com\/hashicorp\/consul\/vendor\/github.com\/hashicorp\/raft\/api.go:491 +0x11a1"}}
{"type":"log","level":"info","time":"2021-10-26T13:39:42.090962394Z","process":"unknown","service":"consul","system":"tas01","neid":"123456","container":"8debfecebc3f","host":"tas01-infra-1","timezone":"CEST","log":{"msg":"github.com\/hashicorp\/consul\/agent\/consul.(*Server).setupRaft(0xc000311600, 0x0, 0x0)"}}
{"type":"log","level":"info","time":"2021-10-26T13:39:42.090965670Z","process":"unknown","service":"consul","system":"tas01","neid":"123456","container":"8debfecebc3f","host":"tas01-infra-1","timezone":"CEST","log":{"msg":"\t\/home\/developer\/rpmbuild\/BUILD\/src\/github.com\/hashicorp\/consul\/agent\/consul\/server.go:651 +0x54f"}}
{"type":"log","level":"info","time":"2021-10-26T13:39:42.090968886Z","process":"unknown","service":"consul","system":"tas01","neid":"123456","container":"8debfecebc3f","host":"tas01-infra-1","timezone":"CEST","log":{"msg":"github.com\/hashicorp\/consul\/agent\/consul.NewServerLogger(0xc000311340, 0xc0001cfd60, 0xc0003ae2a0, 0x0, 0xc0001cfd60, 0xc000257e80)"}}
{"type":"log","level":"info","time":"2021-10-26T13:39:42.090972047Z","process":"unknown","service":"consul","system":"tas01","neid":"123456","container":"8debfecebc3f","host":"tas01-infra-1","timezone":"CEST","log":{"msg":"\t\/home\/developer\/rpmbuild\/BUILD\/src\/github.com\/hashicorp\/consul\/agent\/consul\/server.go:390 +0xa95"}}
{"type":"log","level":"info","time":"2021-10-26T13:39:42.090975014Z","process":"unknown","service":"consul","system":"tas01","neid":"123456","container":"8debfecebc3f","host":"tas01-infra-1","timezone":"CEST","log":{"msg":"github.com\/hashicorp\/consul\/agent.(*Agent).Start(0xc000140a00, 0xc000140a00, 0x0)"}}
{"type":"log","level":"info","time":"2021-10-26T13:39:42.090977792Z","process":"unknown","service":"consul","system":"tas01","neid":"123456","container":"8debfecebc3f","host":"tas01-infra-1","timezone":"CEST","log":{"msg":"\t\/home\/developer\/rpmbuild\/BUILD\/src\/github.com\/hashicorp\/consul\/agent\/agent.go:388 +0x347"}}
{"type":"log","level":"info","time":"2021-10-26T13:39:42.090980666Z","process":"unknown","service":"consul","system":"tas01","neid":"123456","container":"8debfecebc3f","host":"tas01-infra-1","timezone":"CEST","log":{"msg":"github.com\/hashicorp\/consul\/command\/agent.(*cmd).run(0xc000415200, 0xc00004c0d0, 0x9, 0x9, 0x0)"}}
{"type":"log","level":"info","time":"2021-10-26T13:39:42.090991872Z","process":"unknown","service":"consul","system":"tas01","neid":"123456","container":"8debfecebc3f","host":"tas01-infra-1","timezone":"CEST","log":{"msg":"\t\/home\/developer\/rpmbuild\/BUILD\/src\/github.com\/hashicorp\/consul\/command\/agent\/agent.go:226 +0x4d9"}}
{"type":"log","level":"info","time":"2021-10-26T13:39:42.090996137Z","process":"unknown","service":"consul","system":"tas01","neid":"123456","container":"8debfecebc3f","host":"tas01-infra-1","timezone":"CEST","log":{"msg":"github.com\/hashicorp\/consul\/command\/agent.(*cmd).Run(0xc000415200, 0xc00004c0d0, 0x9, 0x9, 0xc0001a19e0)"}}
{"type":"log","level":"info","time":"2021-10-26T13:39:42.090999192Z","process":"unknown","service":"consul","system":"tas01","neid":"123456","container":"8debfecebc3f","host":"tas01-infra-1","timezone":"CEST","log":{"msg":"\t\/home\/developer\/rpmbuild\/BUILD\/src\/github.com\/hashicorp\/consul\/command\/agent\/agent.go:75 +0x4d"}}
{"type":"log","level":"info","time":"2021-10-26T13:39:42.091002188Z","process":"unknown","service":"consul","system":"tas01","neid":"123456","container":"8debfecebc3f","host":"tas01-infra-1","timezone":"CEST","log":{"msg":"github.com\/hashicorp\/consul\/vendor\/github.com\/mitchellh\/cli.(*CLI).Run(0xc000480360, 0xc000480360, 0x80, 0xc0001a1b20)"}}
{"type":"log","level":"info","time":"2021-10-26T13:39:42.091005281Z","process":"unknown","service":"consul","system":"tas01","neid":"123456","container":"8debfecebc3f","host":"tas01-infra-1","timezone":"CEST","log":{"msg":"\t\/home\/developer\/rpmbuild\/BUILD\/src\/github.com\/hashicorp\/consul\/vendor\/github.com\/mitchellh\/cli\/cli.go:242 +0x207"}}
{"type":"log","level":"info","time":"2021-10-26T13:39:42.091008396Z","process":"unknown","service":"consul","system":"tas01","neid":"123456","container":"8debfecebc3f","host":"tas01-infra-1","timezone":"CEST","log":{"msg":"main.realMain(0xc0000b4058)"}}
{"type":"log","level":"info","time":"2021-10-26T13:39:42.091010896Z","process":"unknown","service":"consul","system":"tas01","neid":"123456","container":"8debfecebc3f","host":"tas01-infra-1","timezone":"CEST","log":{"msg":"\t\/home\/developer\/rpmbuild\/BUILD\/src\/github.com\/hashicorp\/consul\/main.go:53 +0x38d"}}
{"type":"log","level":"info","time":"2021-10-26T13:39:42.091013752Z","process":"unknown","service":"consul","system":"tas01","neid":"123456","container":"8debfecebc3f","host":"tas01-infra-1","timezone":"CEST","log":{"msg":"main.main()"}}
{"type":"log","level":"info","time":"2021-10-26T13:39:42.091016136Z","process":"unknown","service":"consul","system":"tas01","neid":"123456","container":"8debfecebc3f","host":"tas01-infra-1","timezone":"CEST","log":{"msg":"\t\/home\/developer\/rpmbuild\/BUILD\/src\/github.com\/hashicorp\/consul\/main.go:20 +0x22"}}

As I read (and also verified) here and here , removing the raft.db file fixes the issue but we cannot afford to lose the data stored there.

I also tried to:

{"type":"log","level":"notice","time":"2021-10-27T08:31:07.278212496Z","process":"consul[96]","service":"consul","system":"tas01","neid":"123456","container":"4b76c6ef47b4","host":"tas01-infra-1","timezone":"CEST","log":{"msg":"    2021\/10\/27 10:31:07 [INFO] snapshot: Creating new snapshot at \/consul\/data\/raft\/snapshots\/124-206-1635323467278.tmp"}}
{"type":"log","level":"notice","time":"2021-10-27T08:31:07.284413880Z","process":"consul[96]","service":"consul","system":"tas01","neid":"123456","container":"4b76c6ef47b4","host":"tas01-infra-1","timezone":"CEST","log":{"msg":"    2021\/10\/27 10:31:07 [INFO] raft: Copied 69387 bytes to local snapshot"}}
{"type":"log","level":"info","time":"2021-10-27T08:31:07.288699005Z","process":"unknown","service":"consul","system":"tas01","neid":"123456","container":"4b76c6ef47b4","host":"tas01-infra-1","timezone":"CEST","log":{"msg":"panic: failed to restore snapshot: failed to restore snapshot 124-206-1635323467278: Unrecognized msg type 30"}}
{"type":"log","level":"info","time":"2021-10-27T08:31:07.288720908Z","process":"unknown","service":"consul","system":"tas01","neid":"123456","container":"4b76c6ef47b4","host":"tas01-infra-1","timezone":"CEST","log":{"msg":""}}
{"type":"log","level":"info","time":"2021-10-27T08:31:07.288724326Z","process":"unknown","service":"consul","system":"tas01","neid":"123456","container":"4b76c6ef47b4","host":"tas01-infra-1","timezone":"CEST","log":{"msg":"goroutine 74 [running]:"}}
{"type":"log","level":"info","time":"2021-10-27T08:31:07.288727502Z","process":"unknown","service":"consul","system":"tas01","neid":"123456","container":"4b76c6ef47b4","host":"tas01-infra-1","timezone":"CEST","log":{"msg":"github.com\/hashicorp\/consul\/vendor\/github.com\/hashicorp\/raft.(*Raft).restoreUserSnapshot(0xc000528b00, 0xc0001dad90, 0x2c22d60, 0xc0002ba058, 0x0, 0x0)"}}
{"type":"log","level":"info","time":"2021-10-27T08:31:07.288731228Z","process":"unknown","service":"consul","system":"tas01","neid":"123456","container":"4b76c6ef47b4","host":"tas01-infra-1","timezone":"CEST","log":{"msg":"\t\/home\/developer\/rpmbuild\/BUILD\/src\/github.com\/hashicorp\/consul\/vendor\/github.com\/hashicorp\/raft\/raft.go:787 +0xd98"}}
{"type":"log","level":"info","time":"2021-10-27T08:31:07.288734495Z","process":"unknown","service":"consul","system":"tas01","neid":"123456","container":"4b76c6ef47b4","host":"tas01-infra-1","timezone":"CEST","log":{"msg":"github.com\/hashicorp\/consul\/vendor\/github.com\/hashicorp\/raft.(*Raft).leaderLoop(0xc000528b00)"}}
{"type":"log","level":"info","time":"2021-10-27T08:31:07.288737614Z","process":"unknown","service":"consul","system":"tas01","neid":"123456","container":"4b76c6ef47b4","host":"tas01-infra-1","timezone":"CEST","log":{"msg":"\t\/home\/developer\/rpmbuild\/BUILD\/src\/github.com\/hashicorp\/consul\/vendor\/github.com\/hashicorp\/raft\/raft.go:574 +0xab3"}}
{"type":"log","level":"info","time":"2021-10-27T08:31:07.288741039Z","process":"unknown","service":"consul","system":"tas01","neid":"123456","container":"4b76c6ef47b4","host":"tas01-infra-1","timezone":"CEST","log":{"msg":"github.com\/hashicorp\/consul\/vendor\/github.com\/hashicorp\/raft.(*Raft).runLeader(0xc000528b00)"}}
{"type":"log","level":"info","time":"2021-10-27T08:31:07.288743964Z","process":"unknown","service":"consul","system":"tas01","neid":"123456","container":"4b76c6ef47b4","host":"tas01-infra-1","timezone":"CEST","log":{"msg":"\t\/home\/developer\/rpmbuild\/BUILD\/src\/github.com\/hashicorp\/consul\/vendor\/github.com\/hashicorp\/raft\/raft.go:420 +0x34b"}}
{"type":"log","level":"info","time":"2021-10-27T08:31:07.288746988Z","process":"unknown","service":"consul","system":"tas01","neid":"123456","container":"4b76c6ef47b4","host":"tas01-infra-1","timezone":"CEST","log":{"msg":"github.com\/hashicorp\/consul\/vendor\/github.com\/hashicorp\/raft.(*Raft).run(0xc000528b00)"}}
{"type":"log","level":"info","time":"2021-10-27T08:31:07.288749831Z","process":"unknown","service":"consul","system":"tas01","neid":"123456","container":"4b76c6ef47b4","host":"tas01-infra-1","timezone":"CEST","log":{"msg":"\t\/home\/developer\/rpmbuild\/BUILD\/src\/github.com\/hashicorp\/consul\/vendor\/github.com\/hashicorp\/raft\/raft.go:140 +0x68"}}
{"type":"log","level":"info","time":"2021-10-27T08:31:07.288753237Z","process":"unknown","service":"consul","system":"tas01","neid":"123456","container":"4b76c6ef47b4","host":"tas01-infra-1","timezone":"CEST","log":{"msg":"github.com\/hashicorp\/consul\/vendor\/github.com\/hashicorp\/raft.(*Raft).run-fm()"}}
{"type":"log","level":"info","time":"2021-10-27T08:31:07.288756250Z","process":"unknown","service":"consul","system":"tas01","neid":"123456","container":"4b76c6ef47b4","host":"tas01-infra-1","timezone":"CEST","log":{"msg":"\t\/home\/developer\/rpmbuild\/BUILD\/src\/github.com\/hashicorp\/consul\/vendor\/github.com\/hashicorp\/raft\/api.go:505 +0x2a"}}
{"type":"log","level":"info","time":"2021-10-27T08:31:07.288759434Z","process":"unknown","service":"consul","system":"tas01","neid":"123456","container":"4b76c6ef47b4","host":"tas01-infra-1","timezone":"CEST","log":{"msg":"github.com\/hashicorp\/consul\/vendor\/github.com\/hashicorp\/raft.(*raftState).goFunc.func1(0xc000528b00, 0xc000039c00)"}}
{"type":"log","level":"info","time":"2021-10-27T08:31:07.288762668Z","process":"unknown","service":"consul","system":"tas01","neid":"123456","container":"4b76c6ef47b4","host":"tas01-infra-1","timezone":"CEST","log":{"msg":"\t\/home\/developer\/rpmbuild\/BUILD\/src\/github.com\/hashicorp\/consul\/vendor\/github.com\/hashicorp\/raft\/state.go:146 +0x53"}}
{"type":"log","level":"info","time":"2021-10-27T08:31:07.288765895Z","process":"unknown","service":"consul","system":"tas01","neid":"123456","container":"4b76c6ef47b4","host":"tas01-infra-1","timezone":"CEST","log":{"msg":"created by github.com\/hashicorp\/consul\/vendor\/github.com\/hashicorp\/raft.(*raftState).goFunc"}}
{"type":"log","level":"info","time":"2021-10-27T08:31:07.288768949Z","process":"unknown","service":"consul","system":"tas01","neid":"123456","container":"4b76c6ef47b4","host":"tas01-infra-1","timezone":"CEST","log":{"msg":"\t\/home\/developer\/rpmbuild\/BUILD\/src\/github.com\/hashicorp\/consul\/vendor\/github.com\/hashicorp\/raft\/state.go:144 +0x66"}}

consul conf file:

{
  "data_dir": "/data",
  "ports": {
    "http": 8500
  },
  "acl_default_policy": "deny",
  "disable_update_check": true,
  "log_level": "INFO",
  "rejoin_after_leave": true,
  "raft_protocol": 3,
  "enable_syslog": false
}

Could you please check this issue and suggest how could we overcome it? Could we probably say that the upgrade/downgrade between those versions is not supported?

Thank you, --Evi

jkirschner-hashicorp commented 3 years ago

Hi @evilin13,

Can you tell me more about the motivation behind the downgrade (and why staying with 1.8.15 isn't an option)?

Downgrades are not specifically supported or tested. Though downgrades can work in some cases, in most cases downgrades won't work because the snapshot from 1.8.15 will contain a Raft log entry that Consul 1.4.0 doesn't understand.

If you know in advance when performing an upgrade that you might want to downgrade, we suggest creating a snapshot before upgrading. Then, if you need to downgrade, you can restore using a snapshot generated from the old/matching version rather than from a newer version (which may fail).

evilin13 commented 3 years ago

Hi @jkirschner-hashicorp ,

Thank you very much for the prompt and detailed response! Rollback is one of the supported actions between two releases of the product I'm working for and recently I upgraded consul in the latest release. When the testing team performed the Rollback scenario to the previous release, we encountered the reported problem. So, I'm looking for a safe way to make the consul work in the rollback scenario even when the consul version changes between two releases of our product.

I tried to create the snapshots before upgrading, then removed all three raft.db files and downgraded to the old version.

Consuls started as expected, but before restoring the snapshots, I queried the Vault server (consul is used as the Vault server's backend in our case) and all the secrets were there. I'm not very familiar with Consul (yet), so I thought that by removing the raft.db files, I'll lose the KV store's content - last week I was removing the entire raft directory, not only the raft.db files, that's why I was losing all the data. :(

Given that I only need to ensure that the secrets/KV store content is not lost during the rollback and that all 3 consul servers will be operational after the rollback, would it be enough just to remove the raft.db files before starting the rollback and don't mess with creating/restoring snapshots at all?

Thank you!

tibistibi commented 1 year ago

just want to give this some more attention. It is common practices to test a downgrade before upgrading a production system. I'm to working on upgrading Consul but before we can go to production we need to check if a downgrade is possible. I hope there is some room to at least create some documentation on how to downgrade with what is and what is not supported

vjshar commented 1 year ago

I’ve faced similar questions in the past from some of my customers and my recommendation have been that to support a rollback/downgrade customers should follow these steps:

  1. Take a backup/snapshot of current version before you start an upgrade .
  2. Restore it back once consul version is rolled back/ downgraded.

It is important to note that Snapshot/backup from an higher version of Consul may not be compatible with a lower version and this will depend on underlying changes in raft storage etc. .