thelastpickle / cassandra-reaper

Automated Repair Awesomeness for Apache Cassandra
http://cassandra-reaper.io/
Apache License 2.0
490 stars 218 forks source link

Provide different liveness and readiness endpoints for Kubernetes deployments #1164

Open jsanda opened 2 years ago

jsanda commented 2 years ago

Project board link

Reaper provides the /healthcheck endpoint which is used in k8ssandra for both the liveness and readiness probe. The ReaperHealthCheck class provides the implementation for the endpoint. It first tries to connect to Cassandra and then performs some queries.

These checks make sense for a readiness probe. Reaper cannot be considered ready if it cannot connect to Cassandra. These checks do not make sense for a liveness probe. A liveness check should simply return a 200 status code. When a liveness check fails Kubernetes will restart the container. I should mention that k8s probes are configurable such that there could be multiple failures before a restart. With that said, Reaper should have a separate liveness check endpoint. If Reaper is trying to connect to Cassandra, it is running. It's just hasn't reached the ready state.

The container restarts can be disruptive and cause some overhead that could otherwise be avoided.

During initialization the ReaperApplication.tryInitializeStorage method is called to initialize the connection to Cassandra. It loops until either it successfully connects or until a failure threshold is exceeded. There is a delay after each failed attempt. This behavior is exactly what we want in a Kubernetes environment; however, if the /healthcheck endpoint is used for the liveness probe, then that retry behavior is pointless. To be clear, my point is that there needs to be a different endpoint for the liveness probe and that the retry logic at startup is correct in the context of Kubernetes.

┆Issue is synchronized with this Jira Story by Unito ┆Issue Number: REAP-94

adejanovski commented 2 years ago

The problem here is that connecting to the database is happening prior to the http endpoints start. So we won't have any probe url served before that 😕 That includes the admin port which exposes the /ping endpoint.

jsanda commented 2 years ago

In ReaperApplication.run() we have:

...
tryInitializeStorage(config, environment);
...
final ReaperHealthCheck healthCheck = new ReaperHealthCheck(context);
environment.healthChecks().register("reaper", healthCheck);
...

Couldn't we just register another endpoint before the tryInitializeStorage call?