hashicorp / vault

A tool for secrets management, encryption as a service, and privileged access management
https://www.vaultproject.io/
Other
31.03k stars 4.2k forks source link

Vault considers itself healthy when can't write to audit log #11949

Open DrMagPie opened 3 years ago

DrMagPie commented 3 years ago

Describe the bug Due to some misconfiguration on our side, logrotate created file for audit log with wrong permissions. After what vault did not server any requests because could not write to audit log (intended behavior). But health check reported that vault is healthy.

I believe if vault can not respond to a request it should be marked as unhealthy.

To Reproduce Steps to reproduce the behavior:

  1. touch /tmp/vault-audit.log

  2. sudo chmod 0640 /tmp/vault-audit.log

  3. vault audit enable file file_path=/tmp/vault-audit.log

  4. vault status

    ---             -----
    Seal Type       shamir
    Initialized     true
    Sealed          false
  5. vault secrets list

    ----          ----         --------              -----------
    cubbyhole/    cubbyhole    cubbyhole_ca631048    per-token private secret storage
    identity/     identity     identity_36ea49cf     identity store
    secret/       kv           kv_2f024240           key/value secret storage
    sys/          system       system_00aaccc1       system endpoints used for control, policy and debugging
  6. sudo chown root:root /tmp/vault-audit.log

  7. kill -1 $VAULT_PID

  8. vault status

    ---             -----
    Seal Type       shamir
    Initialized     true
    Sealed          false
  9. vault secrets list

URL: GET http://127.0.0.1:8200/v1/sys/mounts Code: 500. Errors:

backend log

==> Vault reload triggered
2021-06-28T10:49:28.564Z [ERROR] no config found at reload time
2021-06-28T10:49:28.564Z [INFO]  audit: reloading file audit backend: path=file/
Error(s) were encountered during reload: 1 error occurred:
        * error encountered reloading file audit device at path "file/": open /tmp/vault-audit.log: permission denied

2021-06-28T10:49:48.321Z [ERROR] audit: backend failed to log request: backend=file/ error="open /tmp/vault-audit.log: permission denied"
2021-06-28T10:49:48.321Z [ERROR] core: failed to audit request: path=sys/mounts
  error=
  | 1 error occurred:
  |     * no audit backend succeeded in logging the request
  |

2021-06-28T10:49:48.321Z [ERROR] audit: backend failed to log response: backend=file/ error="open /tmp/vault-audit.log: permission denied"
2021-06-28T10:49:48.321Z [ERROR] core: failed to audit response: request_path=sys/mounts
  error=
  | 1 error occurred:
  |     * no audit backend succeeded in logging the response
  |

Expected behavior Some indication that server is unable to serve requests, and http status code in 5XX when accessing health endpoint

vishalnayak commented 3 years ago

I am not sure if this qualifies as a reason for marking the server as unhealthy. I'll take this up internally to discuss with the team and get back to you if I have an update.

ncabatoff commented 3 years ago

We've had some internal discussions and agree that this is a worthwhile and doable enhancement. There are still some details that would need to be ironed out, such as:

DrMagPie commented 3 years ago

Hi @ncabatoff

I don't really know how health check works (internally). But I think that easiest solution to this issue would be just to audit log /v1/sys/health requests. In this case vault will respond the same way as all other requests.

And about flapping, I'm not sure that this needs to be addressed separately, as it is impacting all requests same way. So it should be handled as any other request. And as additional bonus it will identify when there are some underlying issues, like lack of storage or poorly configured log rotation as examples.

avoidik commented 3 years ago

I assume something is missing here https://github.com/hashicorp/vault/blob/main/http/sys_health.go#L73, health-check is giving misleading information about cluster health (500 vs. 200) when audit device is blocked, maybe a call to generateAuditTestProbe

$ curl -i -H "X-Vault-Token: $(cat ~/.vault-token)" $VAULT_ADDR/v1/sys/mounts
HTTP/1.1 500 Internal Server Error
Cache-Control: no-store
Content-Type: application/json
Date: Thu, 08 Jul 2021 12:18:28 GMT
Content-Length: 30

{"errors":["internal error"]}
$ curl -i $VAULT_ADDR/v1/sys/health?standbyok=true
HTTP/1.1 200 OK
Cache-Control: no-store
Content-Type: application/json
Date: Thu, 08 Jul 2021 12:18:38 GMT
Content-Length: 294

{"initialized":true,"sealed":false,"standby":false,"performance_standby":false,"replication_performance_mode":"disabled","replication_dr_mode":"disabled","server_time_utc":1625746718,"version":"1.7.3","cluster_name":"vault-cluster-a9f6d609","cluster_id":"1268406e-9df0-d74b-43a0-a2c8354afbe6"}

any reason not to make these two consistent?

ncabatoff commented 3 years ago

But I think that easiest solution to this issue would be just to audit log /v1/sys/health requests.

That would certainly be an easy solution. Two problems I have with that approach: first, some systems may be configured to hit this endpoint quite often, and if we turn on auditing we may fill up the disks of those clusters. Picture a system that normally sees 1 audited request per minute, but whose status endpoint is hit 4 times per minute, and that suddenly needs 5 times the audit storage it did before. Second problem: the purpose of audit records is primarily to track authenticated requests, because unauthenticated requests generally don't matter from a secrets perspective. Users may rightly object to an audit log filled with irrelevant status requests that they don't care about.

ncabatoff commented 3 years ago

Since we don't like the idea of doing "probes" (fake audit records) to discover the state of the audit device, here's what we're left with in terms of a plan:

  1. Use a circuit breaker around audit writes: after enough failures over a given period has occurred, make the vault health endpoint return a non-healthy state. Using a circuit breaker will help reduce flapping due to occasional failures.
  2. Periodically, reset the health status of vault nodes that are in an audit-unhealthy state. In other words, we're guessing/hoping that the situation may have been resolved, and we want any load balancer that might be avoiding sending traffic to us to try again. This is the alternative to writing fake audit records to discover when auditing starts working again.
  3. Expose metrics relating to audit failures so that additional insight is available without having to look in logs.

I don't know if we're going to work on this anytime soon. Most users who've been bit by audit related failures in the past now use two different audit devices to protect against this, since Vault will allow requests to proceed provided at least one audit device succeeds. We're willing to accept a PR for the above plan, but since this is a critical part of Vault request handling we'll be very picky about implementation details.

robertdebock commented 2 years ago

We ran into the exact same issue;

I realise that relating the failed audit device(s) and health-check is difficult, but in my opinion: Vault is broken, the only thing that still works is the health endpoint. A bit ironic even. ;-)

This issue is also describe here and here.

abarclay commented 11 months ago

We also ran into this issue just last week. Disk got full on the active node in a cluster. It really would have been nice for the health checks to fail when this happens.