Open DrMagPie opened 3 years ago
I am not sure if this qualifies as a reason for marking the server as unhealthy. I'll take this up internally to discuss with the team and get back to you if I have an update.
We've had some internal discussions and agree that this is a worthwhile and doable enhancement. There are still some details that would need to be ironed out, such as:
Hi @ncabatoff
I don't really know how health check works (internally).
But I think that easiest solution to this issue would be just to audit log /v1/sys/health
requests.
In this case vault will respond the same way as all other requests.
And about flapping, I'm not sure that this needs to be addressed separately, as it is impacting all requests same way. So it should be handled as any other request. And as additional bonus it will identify when there are some underlying issues, like lack of storage or poorly configured log rotation as examples.
I assume something is missing here https://github.com/hashicorp/vault/blob/main/http/sys_health.go#L73, health-check is giving misleading information about cluster health (500 vs. 200) when audit device is blocked, maybe a call to generateAuditTestProbe
$ curl -i -H "X-Vault-Token: $(cat ~/.vault-token)" $VAULT_ADDR/v1/sys/mounts
HTTP/1.1 500 Internal Server Error
Cache-Control: no-store
Content-Type: application/json
Date: Thu, 08 Jul 2021 12:18:28 GMT
Content-Length: 30
{"errors":["internal error"]}
$ curl -i $VAULT_ADDR/v1/sys/health?standbyok=true
HTTP/1.1 200 OK
Cache-Control: no-store
Content-Type: application/json
Date: Thu, 08 Jul 2021 12:18:38 GMT
Content-Length: 294
{"initialized":true,"sealed":false,"standby":false,"performance_standby":false,"replication_performance_mode":"disabled","replication_dr_mode":"disabled","server_time_utc":1625746718,"version":"1.7.3","cluster_name":"vault-cluster-a9f6d609","cluster_id":"1268406e-9df0-d74b-43a0-a2c8354afbe6"}
any reason not to make these two consistent?
But I think that easiest solution to this issue would be just to audit log /v1/sys/health requests.
That would certainly be an easy solution. Two problems I have with that approach: first, some systems may be configured to hit this endpoint quite often, and if we turn on auditing we may fill up the disks of those clusters. Picture a system that normally sees 1 audited request per minute, but whose status endpoint is hit 4 times per minute, and that suddenly needs 5 times the audit storage it did before. Second problem: the purpose of audit records is primarily to track authenticated requests, because unauthenticated requests generally don't matter from a secrets perspective. Users may rightly object to an audit log filled with irrelevant status requests that they don't care about.
Since we don't like the idea of doing "probes" (fake audit records) to discover the state of the audit device, here's what we're left with in terms of a plan:
I don't know if we're going to work on this anytime soon. Most users who've been bit by audit related failures in the past now use two different audit devices to protect against this, since Vault will allow requests to proceed provided at least one audit device succeeds. We're willing to accept a PR for the above plan, but since this is a critical part of Vault request handling we'll be very picky about implementation details.
We ran into the exact same issue;
I realise that relating the failed audit device(s) and health-check is difficult, but in my opinion: Vault is broken, the only thing that still works is the health endpoint. A bit ironic even. ;-)
We also ran into this issue just last week. Disk got full on the active node in a cluster. It really would have been nice for the health checks to fail when this happens.
Describe the bug Due to some misconfiguration on our side, logrotate created file for audit log with wrong permissions. After what vault did not server any requests because could not write to audit log (intended behavior). But health check reported that vault is healthy.
I believe if vault can not respond to a request it should be marked as unhealthy.
To Reproduce Steps to reproduce the behavior:
touch /tmp/vault-audit.log
sudo chmod 0640 /tmp/vault-audit.log
vault audit enable file file_path=/tmp/vault-audit.log
vault status
vault secrets list
sudo chown root:root /tmp/vault-audit.log
kill -1 $VAULT_PID
vault status
vault secrets list
URL: GET http://127.0.0.1:8200/v1/sys/mounts Code: 500. Errors:
backend log
Expected behavior Some indication that server is unable to serve requests, and http status code in 5XX when accessing health endpoint