Vault considers itself healthy when can't write to audit log

DrMagPie commented 3 years ago

Describe the bug Due to some misconfiguration on our side, logrotate created file for audit log with wrong permissions. After what vault did not server any requests because could not write to audit log (intended behavior). But health check reported that vault is healthy.

I believe if vault can not respond to a request it should be marked as unhealthy.

To Reproduce Steps to reproduce the behavior:

touch /tmp/vault-audit.log
sudo chmod 0640 /tmp/vault-audit.log
vault audit enable file file_path=/tmp/vault-audit.log

vault status

---             -----
Seal Type       shamir
Initialized     true
Sealed          false

vault secrets list

----          ----         --------              -----------
cubbyhole/    cubbyhole    cubbyhole_ca631048    per-token private secret storage
identity/     identity     identity_36ea49cf     identity store
secret/       kv           kv_2f024240           key/value secret storage
sys/          system       system_00aaccc1       system endpoints used for control, policy and debugging

sudo chown root:root /tmp/vault-audit.log
kill -1 $VAULT_PID

vault status

---             -----
Seal Type       shamir
Initialized     true
Sealed          false

vault secrets list

URL: GET http://127.0.0.1:8200/v1/sys/mounts Code: 500. Errors:

internal error

backend log

==> Vault reload triggered
2021-06-28T10:49:28.564Z [ERROR] no config found at reload time
2021-06-28T10:49:28.564Z [INFO]  audit: reloading file audit backend: path=file/
Error(s) were encountered during reload: 1 error occurred:
        * error encountered reloading file audit device at path "file/": open /tmp/vault-audit.log: permission denied

2021-06-28T10:49:48.321Z [ERROR] audit: backend failed to log request: backend=file/ error="open /tmp/vault-audit.log: permission denied"
2021-06-28T10:49:48.321Z [ERROR] core: failed to audit request: path=sys/mounts
  error=
  | 1 error occurred:
  |     * no audit backend succeeded in logging the request
  |

2021-06-28T10:49:48.321Z [ERROR] audit: backend failed to log response: backend=file/ error="open /tmp/vault-audit.log: permission denied"
2021-06-28T10:49:48.321Z [ERROR] core: failed to audit response: request_path=sys/mounts
  error=
  | 1 error occurred:
  |     * no audit backend succeeded in logging the response
  |

Expected behavior Some indication that server is unable to serve requests, and http status code in 5XX when accessing health endpoint

vishalnayak commented 3 years ago

I am not sure if this qualifies as a reason for marking the server as unhealthy. I'll take this up internally to discuss with the team and get back to you if I have an update.

ncabatoff commented 3 years ago

We've had some internal discussions and agree that this is a worthwhile and doable enhancement. There are still some details that would need to be ironed out, such as:

how do we determine when it's healthy again - should we retry failed audit requests until they succeed, for example?
how do we avoid flapping when there are intermittent write failures, e.g. a near-full disk where log rotation is periodically making a little bit of space that then fills up again?

DrMagPie commented 3 years ago

Hi @ncabatoff

I don't really know how health check works (internally). But I think that easiest solution to this issue would be just to audit log /v1/sys/health requests. In this case vault will respond the same way as all other requests.

And about flapping, I'm not sure that this needs to be addressed separately, as it is impacting all requests same way. So it should be handled as any other request. And as additional bonus it will identify when there are some underlying issues, like lack of storage or poorly configured log rotation as examples.

avoidik commented 3 years ago

I assume something is missing here https://github.com/hashicorp/vault/blob/main/http/sys_health.go#L73, health-check is giving misleading information about cluster health (500 vs. 200) when audit device is blocked, maybe a call to generateAuditTestProbe

$ curl -i -H "X-Vault-Token: $(cat ~/.vault-token)" $VAULT_ADDR/v1/sys/mounts
HTTP/1.1 500 Internal Server Error
Cache-Control: no-store
Content-Type: application/json
Date: Thu, 08 Jul 2021 12:18:28 GMT
Content-Length: 30

{"errors":["internal error"]}

$ curl -i $VAULT_ADDR/v1/sys/health?standbyok=true
HTTP/1.1 200 OK
Cache-Control: no-store
Content-Type: application/json
Date: Thu, 08 Jul 2021 12:18:38 GMT
Content-Length: 294

{"initialized":true,"sealed":false,"standby":false,"performance_standby":false,"replication_performance_mode":"disabled","replication_dr_mode":"disabled","server_time_utc":1625746718,"version":"1.7.3","cluster_name":"vault-cluster-a9f6d609","cluster_id":"1268406e-9df0-d74b-43a0-a2c8354afbe6"}

any reason not to make these two consistent?

ncabatoff commented 3 years ago

But I think that easiest solution to this issue would be just to audit log /v1/sys/health requests.

That would certainly be an easy solution. Two problems I have with that approach: first, some systems may be configured to hit this endpoint quite often, and if we turn on auditing we may fill up the disks of those clusters. Picture a system that normally sees 1 audited request per minute, but whose status endpoint is hit 4 times per minute, and that suddenly needs 5 times the audit storage it did before. Second problem: the purpose of audit records is primarily to track authenticated requests, because unauthenticated requests generally don't matter from a secrets perspective. Users may rightly object to an audit log filled with irrelevant status requests that they don't care about.

ncabatoff commented 3 years ago

Since we don't like the idea of doing "probes" (fake audit records) to discover the state of the audit device, here's what we're left with in terms of a plan:

Use a circuit breaker around audit writes: after enough failures over a given period has occurred, make the vault health endpoint return a non-healthy state. Using a circuit breaker will help reduce flapping due to occasional failures.
Periodically, reset the health status of vault nodes that are in an audit-unhealthy state. In other words, we're guessing/hoping that the situation may have been resolved, and we want any load balancer that might be avoiding sending traffic to us to try again. This is the alternative to writing fake audit records to discover when auditing starts working again.
Expose metrics relating to audit failures so that additional insight is available without having to look in logs.

I don't know if we're going to work on this anytime soon. Most users who've been bit by audit related failures in the past now use two different audit devices to protect against this, since Vault will allow requests to proceed provided at least one audit device succeeds. We're willing to accept a PR for the above plan, but since this is a critical part of Vault request handling we'll be very picky about implementation details.

robertdebock commented 2 years ago

We ran into the exact same issue;

No audit device left (Vault is broken)
The health point returns "okay".

I realise that relating the failed audit device(s) and health-check is difficult, but in my opinion: Vault is broken, the only thing that still works is the health endpoint. A bit ironic even. ;-)

This issue is also describe here and here.

abarclay commented 11 months ago

We also ran into this issue just last week. Disk got full on the active node in a cluster. It really would have been nice for the health checks to fail when this happens.

hashicorp / vault

Vault considers itself healthy when can't write to audit log #11949