Kubernetes API errors are not logged as errors

rbayerl commented 1 year ago

A recent change to the Kubernetes auth method changed error logs to debug logs. These errors can be raised for a number of reasons:

The Kubernetes API timed out
The Kubernetes API is broken
The CA certificate does not match
There's an issue with the request

Vault does not differentiate between these separate and very different issues and just always returns HTTP 403 Permission Denied. This wasn't a big deal when context could be found in the logs, but now those logs are gone too. This makes debugging issues much harder by requiring the log level to be set to debug constantly to catch these issues. Can we either:

Change the log level back to error (easiest)
Include the error response in what is sent back to the client
Send the appropriate response code to the client (instead of just Permission Denied for any error)
Both 2 and 3 (best)

I'm happy to open a PR and implement the solution, but it would be nice to have an idea of which option the maintainers prefer.

tomhjp commented 1 year ago

Thanks for the issue, I think this is a good one to discuss and improve. The UX concerns you highlight are legitimate, but we also need to balance that against the security requirements, which are a bit tricky because this error and log message gets triggered from an unauthenticated endpoint. I think that probably rules out option 2 altogether, because we want to be especially cautious not to leak any information on unauthenticated endpoints.

Option 3 is kind of in the same bucket, but feels not as bad because we could guarantee that we only tell the client it is misconfigured, but won't include any potentially leaky information in an error string. What specific error code would you suggest?

Option 1 allows unauthenticated clients to spam Vault server logs with errors, which isn't ideal. Option 5, we could also try to do some classification on the error and log it as an error if we're confident it's a misconfiguration, but otherwise log as debug. WDYT?

rbayerl commented 1 year ago

Thanks for the issue, I think this is a good one to discuss and improve. The UX concerns you highlight are legitimate, but we also need to balance that against the security requirements, which are a bit tricky because this error and log message gets triggered from an unauthenticated endpoint. I think that probably rules out option 2 altogether, because we want to be especially cautious not to leak any information on unauthenticated endpoints.

Good call. I didn't really consider the security implications - which is why I raised this for discussion.

Option 3 is kind of in the same bucket, but feels not as bad because we could guarantee that we only tell the client it is misconfigured, but won't include any potentially leaky information in an error string. What specific error code would you suggest?

Response codes are hard when Vault is just acting as a passthrough. Calling it out as a client side error typically means clients won't retry, though. On the other hand calling it a server side error might trigger some team's alarms for monitoring when Vault is malfunctioning...which is not strictly the case here (Vault is fine but a dependency is broken). Maybe 504 Gateway Timeout would be most appropriate?

Option 1 allows unauthenticated clients to spam Vault server logs with errors, which isn't ideal. Option 5, we could also try to do some classification on the error and log it as an error if we're confident it's a misconfiguration, but otherwise log as debug. WDYT?

Option 5 sounds like a great solution. With error classification we could also send different response codes to the client if that's acceptable as well 👍.

rbayerl commented 1 year ago

@tomhjp are we in agreement on option 5? Would you like me to write up a PR or does someone else want to handle this?

roy-work commented 1 year ago

Option 1 allows unauthenticated clients to spam Vault server logs with errors, which isn't ideal.

If this is a serious concern, rate limit the logs.

But having Vault silently swallowing internal errors while processing auth just cost me two days of debugging.

tanishq-dubey commented 4 months ago

Agree with @roy-work , lack of errors being bubbled up makes the logging effectively useless and requires the user to drop to a DEBUG or TRACE level, at which point they are already being spammed with logs.

Related to https://github.com/hashicorp/vault-plugin-auth-kubernetes/issues/184 where the DEBUG log contains the endpoint where the context is cancelled, however that should be an ERROR since it makes vault inoperable.

hashicorp / vault-plugin-auth-kubernetes

Kubernetes API errors are not logged as errors #168