spiffe / spire

The SPIFFE Runtime Environment
https://spiffe.io
Apache License 2.0
1.73k stars 461 forks source link

Improve Agent health check subsystem for steady-state health #4316

Open rturner3 opened 1 year ago

rturner3 commented 1 year ago

The agent health check endpoint does a basic test to see if the agent can reach the agent Workload Endpoint. This health check criteria is helpful at initialization time of the agent to make sure it is fully up and running. However, it is not the most thorough check for steady-state health.

There are other checks that could give a more accurate indication of steady-state health than Workload Endpoint availability, such as checking the last time the agent successfully fetched entries from the server, or if the agent has failed to get SVIDs signed in its last attempt (or last few consecutive attempts). The last successful sync time is exposed over the debug API, however this API does not provide an aggregated view of agent health.

If the agent is unable to synchronize with the server, it stays up, but with an older cached state of its authorized entries and signed SVIDs. If the agent is stuck in this state for an extended period, SVIDs it has cached may expire, and workloads dependent on those SVIDs may be affected. It would be helpful to be able to detect agent degradations in advance of SVIDs expiring using the Agent's gRPC health endpoint. This would provide operators a chance to quarantine the node with the unhealthy agent before workloads start getting impacted.

github-actions[bot] commented 1 week ago

This issue is stale because it has been open for 365 days with no activity.