Need readiness check endpoint for PD

hanlins commented 1 year ago

Feature Request

Describe your feature request related problem

When doing a rolling update for a PD cluster, we need to turn down a PD instance, then bring up a PD instance with the updated config, and repeat this process for all PD instances one at a time. Typically we use tidb-operator for automating this, and a PD instance is considered "Running" as long as the pod is up for some duration. However, it could be the PD instance is still picking up with the leader and receiving etcd raft logs, and cannot serve the request at the moment. If let's say we have a deployment of PDs that has 3 replicas, the first two pods are updated but not ready for serving new requests, then at this point the PD is unavailable since the remaining PD cannot write quorum (as the other two are still syncing etcd).

Describe the feature you'd like

It would be helpful if PD instance can expose info (maybe via a restful endpoint) whether its internal data is synced with leader and ready for serving new requests.

Describe alternatives you've considered

An alternative approach in tidb-operator scenario is to add a init container in PD's pod spec that sleeps for certain time. Before sleep is done, the Pod will stuck in Init state so the rolling update process will wait til the sleep completes. Hopefully, during this time, the etcd data for that PD instance is synced.

Teachability, Documentation, Adoption, Migration Strategy

The endpoint should be something new, and since old system is depending on it, we don't need to worry about the migration.

hanlins commented 1 year ago

Proposing to add a restful endpoint which is similar to the health endpoint to expose its etcd readiness information https://github.com/tikv/pd/blob/e53caec3f309efef60b8fcd6c4a8b55a2ee17bfd/tools/pd-ctl/pdctl/command/health_command.go#L24

The logic for readiness check would be checking if the current PD's corresponding etcd member is still a learner. If an etcd member is synced with the leader, then it should have been promoted as a voting member, i.e. a follower^1. In the etcd cluster status response, the IS LEARNER info is exposed^2 so I suppose it's feasible to do so.

cc @nolouch

rleungx commented 1 year ago

IMO, when we do a rolling update, there also could be a log lag between followers and the leader, which cannot be solved by only checking the learner.

hanlins commented 1 year ago

IMO, when we do a rolling update, there also could be a log lag between followers and the leader, which cannot be solved by only checking the learner.

Sure, thanks for pointing that out! Any suggestions on what should we check?

rleungx commented 1 year ago

Can we just use health API?

hanlins commented 1 year ago

Can we just use health API?

Was checking CheckHealth, it seems to be checking all members rather than checking whether the current PD instance is ready or not? https://github.com/tikv/pd/blob/91f16642cfa3dc53da5aafa481afa63159224aaf/server/cluster/cluster.go#L2407-L2432

I think in k8s rolling upgrade, we would be interested in whether the current PD instance has been keeping up with the leader. What about also checking the applied index for current PD against the leader's committed index?

hanlins commented 1 year ago

Also for the health endpoint, it seems it's returning 200 status all the time https://github.com/tikv/pd/blob/01b8f34a4034d4c36fc0fb56e0fc7892bb6ecfa9/server/api/health.go#L73 In k8s we typically need the endpoint to return a status code larger than 400 if a status check probe fails^1, quote:

Any code greater than or equal to 200 and less than 400 indicates success. Any other code indicates failure.

tikv / pd