PubSubHealthIndicator reports health as DOWN in case of deadline exceeded error

irenavy commented 3 years ago

For one of our services deployed on GKE, we have been noticing frequent container restarts because of liveliness and readiness probes failures. With management.endpoint.health.show-details=always, we found that the status for pubsub was down with deadline exceeded error.

pubSub":{"status":"DOWN","details":{"error":"com.google.api.gax.rpc.DeadlineExceededException: io.grpc.StatusRuntimeException: DEADLINE_EXCEEDED: deadline exceeded after 24.999356599s. [buffered_nanos=25000339743, waiting_for_connection]"}

As per the implementation of PubSubHealthIndicator, it determines health by checking for a random subscription. Is it correct to consider pubsub as not ready based on deadline exceeded error?

Also, what can be causing this error?

elefeint commented 3 years ago

A deadline exceeded error could certainly indicate a valid connection issue. 25 seconds is a very long time to wait for a Pub/Sub response! Do your other Pub/Sub operations or the GCP console show issues?

irenavy commented 3 years ago

Messages are getting consumed and processed successfully when container is in running state, I don't see any warnings/errors in application logs. We have observed that /actuator/health endpoint takes more than a minute to respond I checked enabling debug level logs for org.springframework.cloud.gcp, but found nothing related to this issue. Anywhere else can I look into to debug this issue?

elefeint commented 3 years ago

Did these actuator-driven restarts begin at a particular point in time, or have they always been happening?

irenavy commented 3 years ago

I have been monitoring service pods since last 2 days. From the observation, the restarts are frequent when the incoming message volume is medium to high.

elefeint commented 3 years ago

This helped! What I think happens is: the business logic done in message processing takes time, and takes up the available Pub/Sub subscriber threads. The healthcheck pull call then ends up never getting scheduled, leading to the waiting_for_connection deadline exceeded in the original message.

Try setting spring.cloud.gcp.pubsub.subscriber.executor-threads to a higher number (the default the client library ships with is 4).

patpe commented 3 years ago

I am experiencing this issue as well under exactly the same conditions. GKE and when the backlog of messages is large, Kubernetes checks against /actuator/health times out and we run into recurring restarts (which causes other issues in message processing since our JPA entity manager is forcefully closed etc.).

We have not configured anything special in our application, i.e. vanilla sping-cloud-gcp-pubsub. I recommend changing the implementation of the health check to either check for some internal state (nr of messages successfully pulled last X seconds?) and if that signals that nothing has happened, try to pull a message. Having a default implementation that triggers this side effect seems... not good for lack of better words.

meltsufin commented 3 years ago

@patpe Thanks for the feedback. We're continuing this discussion here. We've discussed ideas similar to what you're proposing but haven't gotten around to working on them. We definitely welcome contributions!

elefeint commented 3 years ago

As a quick workaround, I wonder if giving the healthcheck a dedicated PubSubTemplate/executor pool would fix all these issues. The root problem is that under high load it takes too long for a thread to get scheduled to pull.

Sslimon commented 2 years ago

Hi all

I had the same problem with the health endpoint and the deadline exception after some changes in the business logic. The business logic was much slower than before and there were a lot of messages in the subscriptions.

I read this article: https://medium.com/google-cloud/things-i-wish-i-knew-about-google-cloud-pub-sub-part-2-b037f1f08318 and set the max outstandig element count property to 10: spring.cloud.gcp.pubsub.[subscriber,publisher.batching].flow-control.max-outstanding-element-count=10 Description from doc: Maximum number of outstanding elements to keep in memory before enforcing flow control.

It looks like it's working now. And I think it could make sense since all the executor threads where busy handling the 1000 (I think thats the default value) messages in one period/stream and no one got free for doing the actuator request.

All the best Simon

elefeint commented 2 years ago

@Sslimon That's a good workaround for slow message processing!

Sslimon commented 2 years ago

@elefeint The confusing part for me was that the service had no problems before the update. The changes had nothing to do with pubsub and after the deployment Kubernetes restarted the services und they didn't came up. It was hard for me to find the error and the reason why the health endpoint wasn't responding. But now I'm happy that it is running and I can make the code faster :)

meltsufin commented 2 years ago

Thanks for the update!

elefeint commented 2 years ago

Revisiting this because we had another similarly behaving healthcheck contributed, and we'd like to make sure no problems are introduced.

@irenavy @Sslimon @patpe Does your application explicitly list the Pub/Sub healthcheck in the management.endpoint.health.group.liveness.include property? The reason I ask is that by default, no health indicators are included in liveness/readiness probes, and in fact Spring Actuator cautions against liveness probe depending on external system healthchecks for this exact reason -- that Kubernetes will trigger pod restart upon a liveness failure.

spring-attic / spring-cloud-gcp

PubSubHealthIndicator reports health as DOWN in case of deadline exceeded error #2628