Closed etherandrius closed 3 years ago
Hi @etherandrius,
Thanks for reporting this. It's hard to say what happened without stack traces. If it happens again could you send a SIGUSR2 to the active node and find the stack traces in the logs?
@ncabatoff
I will.
As a follow up to this could we add functionality to send SIGUSR2 via vault debug
in case the main routine fails ?
As a follow up to this could we add functionality to send SIGUSR2 via vault debug in case the main routine fails ?
There's no reason to assume vault debug
is running on the same host as the vault server; indeed it's typically the opposite in my experience.
Closing the issue due to staleness.
Closing stale issues helps us keep the issue count down and the project healthy. Keeping the issue count under a manageable number helps us provide faster responses and better engagement with the community.
If you feel that the issue is still relevant, or if it is wrongly closed, please leave a comment and we'd be happy to reopen it.
@vishalnayak I'd like to state that we've had a direct replica of this error. Running 1.7.2
To resolve we had to manually remove the instance with the increase of goroutines which then resolved the deadlock.
@s3than we're definitely interested in hearing more. I suggest you open a new bug and provide us with whatever details you have.
Describe the bug post-election task failed on a newly elected leader. This resulted in a cluster wide outage, which did not self resolve. Faulty leader had to be terminated manually.
To Reproduce I was not able to reproduce the issue
Expected behavior Either post-election task self recovered or leadership to be taken over by another vault instance.
Environment:
Vault server configuration file(s):
Additional context Logs from vault leader
Go routines were steadily increasing, until termination of the instance
I was not able to run
vault debug
received errorError during validation: unable to connect to server: context deadline exceeded
I was not able to curl
/v1/sys/pprof/goroutine
the connection hung for 20+min before I canceled it.The issue seems similar to https://github.com/hashicorp/vault/issues/11276 and https://github.com/hashicorp/vault/pull/10456. However, we are running 1.6.3 and the bug was supposed to be fixed in 1.6.1.
So far we've only observed this once