hashicorp / vault

A tool for secrets management, encryption as a service, and privileged access management
https://www.vaultproject.io/
Other
30.98k stars 4.19k forks source link

Deadlock after a leader election. #11436

Closed etherandrius closed 3 years ago

etherandrius commented 3 years ago

Describe the bug post-election task failed on a newly elected leader. This resulted in a cluster wide outage, which did not self resolve. Faulty leader had to be terminated manually.

To Reproduce I was not able to reproduce the issue

Expected behavior Either post-election task self recovered or leadership to be taken over by another vault instance.

Environment:

$ vault status
Key                    Value
---                    -----
Seal Type              shamir
Initialized            true
Sealed                 false
Total Shares           5
Threshold              3
Version                1.6.3
Storage Type           postgresql
Cluster Name           vault-cluster-69fd2ba1
Cluster ID             19039e5d-99da-5d8b-bf4a-d8e9b2c31ead
HA Enabled             true
HA Cluster             https://10.0.1.90:8201
HA Mode                standby
Active Node Address    https://10.0.1.90:8200
$ vault version
Vault v1.6.3 (b540be4b7ec48d0dd7512c8d8df9399d6bf84d76)
$ lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description:    Ubuntu 18.04.5 LTS
Release:    18.04
Codename:   bionic

$ uname -m
x86_64

Vault server configuration file(s):

listener "tcp" {
  address       = "0.0.0.0:8200"
  tls_cert_file = "/vault/config/cert.pem"
  tls_key_file  = "/vault/config/key.pem"
  tls_client_ca_file = "/vault/config/ca.pem"
}

cluster_addr = "https://10.0.1.90:8201"
api_addr     = "https://10.0.1.90:8200"

telemetry {
  dogstatsd_addr = "localhost:8125"
}

max_lease_ttl = "87600h"

plugin_directory = "/vault/config/plugins"

Additional context Logs from vault leader

INFO acquired lock, enabling active operation
INFO post-unseal setup starting
WARN no TLS config found for ALPN: ALPN=[req_fw_sb-act_v1]
WARN no TLS config found for ALPN: ALPN=[req_fw_sb-act_v1]
WARN no TLS config found for ALPN: ALPN=[req_fw_sb-act_v1]
WARN no TLS config found for ALPN: ALPN=[req_fw_sb-act_v1]
WARN no TLS config found for ALPN: ALPN=[req_fw_sb-act_v1]
WARN no TLS config found for ALPN: ALPN=[req_fw_sb-act_v1]
WARN no TLS config found for ALPN: ALPN=[req_fw_sb-act_v1]

Go routines were steadily increasing, until termination of the instance image

I was not able to run vault debug received error Error during validation: unable to connect to server: context deadline exceeded

I was not able to curl /v1/sys/pprof/goroutine the connection hung for 20+min before I canceled it.

The issue seems similar to https://github.com/hashicorp/vault/issues/11276 and https://github.com/hashicorp/vault/pull/10456. However, we are running 1.6.3 and the bug was supposed to be fixed in 1.6.1.

So far we've only observed this once

ncabatoff commented 3 years ago

Hi @etherandrius,

Thanks for reporting this. It's hard to say what happened without stack traces. If it happens again could you send a SIGUSR2 to the active node and find the stack traces in the logs?

etherandrius commented 3 years ago

@ncabatoff

I will.

As a follow up to this could we add functionality to send SIGUSR2 via vault debug in case the main routine fails ?

ncabatoff commented 3 years ago

As a follow up to this could we add functionality to send SIGUSR2 via vault debug in case the main routine fails ?

There's no reason to assume vault debug is running on the same host as the vault server; indeed it's typically the opposite in my experience.

vishalnayak commented 3 years ago

Closing the issue due to staleness.

Closing stale issues helps us keep the issue count down and the project healthy. Keeping the issue count under a manageable number helps us provide faster responses and better engagement with the community.

If you feel that the issue is still relevant, or if it is wrongly closed, please leave a comment and we'd be happy to reopen it.

s3than commented 3 years ago

@vishalnayak I'd like to state that we've had a direct replica of this error. Running 1.7.2

s3than commented 3 years ago

To resolve we had to manually remove the instance with the increase of goroutines which then resolved the deadlock.

ncabatoff commented 3 years ago

@s3than we're definitely interested in hearing more. I suggest you open a new bug and provide us with whatever details you have.