Include database status on the /health endpoint or a new endpoint like /db-health

sunshineo commented 4 years ago

Is your feature request related to a problem? Please describe. Deployed Parse server using k8 and used /parse/health for the readinessProbe and livenessProbe probe. But when deploy new code and broke the database connection, /parse/health still returns 200 OK. So old container was killed and new broken one brought online

Describe the solution you'd like Only return 200 ok after db is connected on /parse/health or /parse/db-health

Describe alternatives you've considered Tried to create a cloud function that checks db connection but that requires POST and have application-id in the header, which k8 cannot do

Additional context Related to https://github.com/parse-community/parse-server/issues/4575

davimacedo commented 4 years ago

It sounds reasonable for me. Are willed to tackle this one?

mtrezza commented 4 years ago

I think it's a good PR but this is definitely a breaking change and should be denoted clearly in the release notes.

Currently, calling /parse/health is an endpoint that could be in use to tell whether parse server is reachable via a given domain, for connection string failover. So the information derived from the endpoint response is slightly different when it also depends on the DB connection string health.

sunshineo commented 4 years ago

New info I learned: for K8, there is a liveness probe and a readiness probe. The liveness probe should be current /parse/health . It only means if the docker is responsible or not. And if not, K8 will restart the docker. But a readiness probe should include if there is a proper database connection because if that is true, K8 will send traffic to it.

So we should have a separated url like /parse/db-health

mtrezza commented 4 years ago

Good point. Is there a scenario in which the DB is not ready immediately and will become ready after some time? Does parse server support a DB connection that is not ready on start-up?

What do you think about adding a general /parse/ready endpoint? I am thinking about a cache system or other modules that may need to become ready for parse server to function. So all these modules could be summarized in a single ready endpoint, and it's easily extendable.

sunshineo commented 4 years ago

Thank you @mtrezza . I'll try to create a ready endpoint starting with just db and later we can add more. I plan to just do a query on the _User collection with count=0. I think it will work but let me know if there is a better way

mtrezza commented 4 years ago

It depends on how you define “ready” for the DB, which has itself a status endpoint.

For example:

The driver has a isConnected method to determine whether a connection between driver and DB is established.

https://mongodb.github.io/node-mongodb-native/3.5/api/MongoClient.html#isConnected

The server status, which may be a good middle ground for your ready endpoint.

https://mongodb.github.io/node-mongodb-native/3.5/api/Admin.html#serverStatus

The replica set status, where you determine the individual node status. Just because you can execute a query, doesn’t mean the replica set is fully up and running and ready for what you consider your modus operandi. Maybe a separate PR in the future, if someone wants to get into these details.

https://mongodb.github.io/node-mongodb-native/3.5/api/Admin.html#replSetGetStatus

_{Sent with GitHawk}

davimacedo commented 4 years ago

Thank you @mtrezza . I'll try to create a ready endpoint starting with just db and later we can add more. I plan to just do a query on the _User collection with count=0. I think it will work but let me know if there is a better way

I'd do a query in the schema table.

mtrezza commented 4 years ago

I did some research and came across two concepts for these endpoints. Maybe it can give you some inspiration.

1) Accumulative endpoints

/health/live - The application is up and running. /health/ready - The application is ready to serve requests. /health - Accumulating all health check procedures in the application.

https://quarkus.io/guides/microprofile-health

2) Separate endpoints

GET /-/health Only checks whether the application server is running It does not verify the database or other services are running. A successful response will return a 200 status code

GET /-/readiness Sometimes, applications are temporarily unable to serve traffic. For example, an application might need to load large data or configuration files during startup, or depend on external services after startup. In such cases, you don’t want to kill the application, but you don’t want to send it requests either.

GET /-/liveness Many applications running for long periods of time eventually transition to broken states, and cannot recover except by being restarted.

GET /-/metrics Node process metrics Infrastructure to collect internal metrics (like southbound APIs) API to retrieve the collected metrics and to reset them.

https://www.npmjs.com/package/@ozawa/express-health-check-middleware

davimacedo commented 4 years ago

I believe option 1 is good enough for now but option 2 would be really great. I have no idea how we could ensure all those states in option 2 though.

mtrezza commented 4 years ago

I think we should decide for a long-term option now, because if we want to change this in a few months, we may have a breaking change and things get more complex.

Option 1 takes /health as summary endpoint (everything is running and ready).
Option 2 takes /health as a single-aspect endpoint (parse server is running, but maybe not ready).

What is more common in 3rd party systems to ensure compatibility and practicability for this endpoint?

mman commented 4 years ago

Looking around my k8s cluster, this is what prometheus uses:

    Liveness:     http-get http://:web/-/healthy delay=0s timeout=3s period=5s #success=1 #failure=6
    Readiness:    http-get http://:web/-/ready delay=0s timeout=3s period=5s #success=1 #failure=120

and this is what ambassador uses:

    Liveness:   http-get http://:8877/ambassador/v0/check_alive delay=30s timeout=1s period=3s #success=1 #failure=3
    Readiness:  http-get http://:8877/ambassador/v0/check_ready delay=30s timeout=1s period=3s #success=1 #failure=3

So IMHO focusing on ready, vs. live, where live probe avoids accidental restarts and ready probe serves as a signal to start routing traffic sounds better then to focus on overall health vs db health.

This is a great resource where I think we should aim to go: https://cloud.google.com/blog/products/gcp/kubernetes-best-practices-setting-up-health-checks-with-readiness-and-liveness-probes

mtrezza commented 4 years ago

@mman Great research, makes absolutely sense. If I understand it correctly, it’s option 2, right?

_{Sent with GitHawk}

mman commented 4 years ago

Yes, option 2, but also potentially changing names of the endpoints to better indicate their purpose. /healthy + /ready as prometheus uses sounds logical to me, perhaps keeping the legacy /health around pointing towards /healthy as well for compatibility reasons.

parse-community / parse-server

Include database status on the /health endpoint or a new endpoint like /db-health #6588