temporalio / helm-charts

Temporal Helm charts
MIT License
305 stars 337 forks source link

Implementing Dependency-Responsive Liveliness Probes in Frontend Pod #442

Closed ziemowit141 closed 3 months ago

ziemowit141 commented 11 months ago

Is your feature request related to a problem? Please describe.

We have identified a critical issue in our system architecture related to service interdependencies. Specifically, this problem arises when the matching pod encounters a failure, leading to a disruption in the connection. This disruption necessitates a manual restart of the frontend pod. The reason behind this is that the connections to dependent services are initialized during the startup phase of the frontend pod. Consequently, any interruption in these services post-launch results in the following error:

If that happens then we get this error: Screenshot 2023-11-10 at 12 42 35

Describe the solution you'd like

To mitigate this issue, we propose the implementation of advanced liveliness probes within the frontend pod. These probes will be responsible for continuously monitoring the health status of all dependent services. In the event of a service failure, these probes will automatically initiate a restart of the frontend pod, thereby restoring the system's operational status without manual intervention.

Additional context

I'm not sure about the root-cause of matching pod failing, if I establish that I will add those details to the issue.

robholland commented 3 months ago

If the frontend pods do not automatically reconnect to the matching service, please file an issue against the Temporal repo. Compensating for that should not be handled at the Kubernetes layer. It should not be the case, for example, that everytime matching is rolled out that the frontend pods report themselves unhealthy and get restarted. That will cause frontend connection churn for no reason.