ory / keto

The most scalable and customizable permission server on the market. Fix your slow or broken permission system with Google's proven "Zanzibar" approach. Supports ACL, RBAC, and more. Written in Go, cloud native, headless, API-first. Available as a service on Ory Network and for self-hosters.
https://www.ory.sh/?utm_source=github&utm_medium=banner&utm_campaign=keto
Apache License 2.0
4.85k stars 346 forks source link

Ready check does not include current database connectivity #831

Open Waidmann opened 2 years ago

Waidmann commented 2 years ago

Preflight checklist

Describe the bug

The health/ready endpoint returns OK when database connectivity is no longer given. I would expect it to check this because according to the docs: This endpoint returns a 200 status code when the HTTP server is up running and the environment dependencies (e.g. the database) are responsive as well..

Reproducing the bug

  1. Setup postgres service in k8s cluster
  2. Deploy keto to cluster with dsn pointing to postgres service
  3. Kill postgres
  4. Call keto 'health/ready' endpoint -> Returns OK

However when I try to insert/query tuples I will obviously be greeted with an error code.

Relevant log output

No response

Relevant configuration

No response

Version

0.6.0-alpha.1

On which operating system are you observing this issue?

No response

In which environment are you deploying?

Kubernetes with Helm

Additional Context

No response

zepatrik commented 2 years ago

Good point, that should really be the case.

zepatrik commented 2 years ago

The ready-checkers are registered here: https://github.com/ory/keto/blob/e9e6385fabeb333b9115cbb21276864e6d561640/internal/driver/registry_default.go#L88 Currently none are registered, which means that Keto appears healthy as soon as it runs.

nickjn92 commented 2 years ago

From a kubernetes point of view, you dont want to include external dependencies, such as a database, in your readiness checks. Otherwise you might end up a in a cascading failures scenario where all pods are taken down and are unable to serve requests, and you are greeted with some generic error that does'nt really inform about whats causing the issue. I believe the best practice is to rely on monitoring to determine whats causing the errors and if you need to wait for database to be up you can use a initContainer or lifecycle hooks

zepatrik commented 2 years ago

Interesting standpoint, maybe @Demonsthere can give his opinion on this? Keto is generally not able to serve any request without a working database connection. Init migration jobs will also not complete, so you will end up in an error loop anyways on helm install. But yeah, killing a pod just because the database is unavailable is also not helpful :thinking:

Demonsthere commented 2 years ago

Imho, from a deployment perspective:

zepatrik commented 2 years ago

Sounds good, so basically we would ping the database on startup and report as ready once that succeeded. Further ready checks will not ping the database again, but always return true. Later we can add a check that pings the db periodically.

mstrYoda commented 2 years ago

In Kubernetes, we can define the failure threshold to retry before restarting pods. Also, we can define initialDelaySeconds to wait for some operational tasks to be complete before sending health/readiness requests.

IMHO, I think that adding database health check might be good as well.

Demonsthere commented 2 years ago

In the helm charts the values for probes are exposed and can be configured to your liking :)

Demonsthere commented 2 years ago

Edit: we actually run into a related issue some time ago 😅 which caused us to rethink the setup a bit. We now have exposed the option to change the probes to custom ones, as seen here in kratos, and will work on reworking the healthchecks in general

aeneasr commented 1 year ago

Isn't this solved now? I think one of the probes now checks DB connectivity

zepatrik commented 1 year ago

They would have to be added here right? https://github.com/ory/keto/blob/9215c0670541b36a279fa682b685aba0381a0ae3/internal/driver/registry_default.go#L122 Maybe that was a different project, and we can transfer the change?

aeneasr commented 1 year ago

:O Yes, definitely, that needs to be checked! Otherwise we could run into an outage if we encounter one of those SQL connection bugs with cockroach that need a pod restart

https://github.com/ory/kratos/blob/4181fbc381b46df5cd79941f20fc885c7a1e1b47/driver/registry_default.go#L255-L273

jonas-jonas commented 1 year ago

Should be possible to more or less copy from Kratos: https://github.com/ory/kratos/blob/master/driver/registry_default.go#L252-L280

aran commented 11 months ago

I just ran into an issue using Postgresql as backend, with calls to Keto reporting something like:

unable to fetch records...terminating connection due to administrator command (SQLSTATE 57P01) with gRPC code Unknown.

DB was up and retries didn't work. However, restarting the pod worked. I am wondering if there's a chance of this issue making it over the finish line?