Closed cristiangraz closed 5 years ago
When / where are you acking messages? Also, could you provide your ReceiveSettings?
The only ReceiveSettings
we set are the ones above -- MaxOutstandingMessages
(the exact number varies by worker), Synchronous=true
, and in some cases MaxExtension
.
Every worker we've seen this issue occur had Synchronous=true
and MaxOutstandingMessages
set. Some of the workers had set MaxExtension
and others hadn't. So that one must've fallen back to DefaultReceiveSettings
.
We provide each worker with a custom Message
interface that has a Done(ack bool) error
method signature. Every worker has a defer m.Done(false)
which acts as a no-op if m.Done
has been called. Otherwise we m.Done(true)
if completed successfully. So the acks/nacks are called from the worker doing the work. In the code above you can see where we set the ack
and nack
functions needed on message
from gpubsub
, which in turn are called by m.Done(ack bool)
depending on the value of the ack
bool.
We alias the import: gpubsub "cloud.google.com/go/pubsub"
We're seeing similar occasional Subscription issues where messages start backing up in the PubSub and stop calling our callback function, without Subscription.Receive throwing an error.
We have Synchronous = true and MaxOutstandingMessages < 100
Our Dockerfile is From golang:1.11 running in a GKE Kubernetes pod.
google-cloud-go version = v0.34.0
grpc version = v1.17.0
I will try updating google-cloud-go version and grpc version, but it looks like that won't be the fix.
@jadekler Wanted to follow up on reclassifying this from type: question
to investigating
or bug
. It seems others are also having this issue as well.
This has significant effect because it's an intermittent, quiet failure of either Pull
failing or the callback not being called without an error. Since streaming pull was released, there are many issues here with comments encouraging users to use synchronous pull for their use case, and many may not realize this is happening in production since there are no errors/logs -- messages may just be quietly expiring after the 7 day period of not being acked.
And while there is some Stackdriver monitoring, as far as I know there is no way to periodically check via the API to see the number of unacknowledged messages on a particular subscription in order to manually intervene or even programmatically restart a worker that may not be completing pending work.
Sorry, this fell off my radar. Back to investigating it.
Thanks again for your experience reports. I've done more spelunking, and unfortunately failed to find a reason this might happen, nor am I able to reproduce this. Next steps:
Could y'all double and triple check that Acks/Nacks are happening on every message? Each message that does not get acked/nacked forever counts against flow control, until enough messages that got lost pile up that the flow controller never has any flow control remaining (which means messages never get pulled again).
Could you add logging around the following, and then provide these logs around the time that you experienced the disruption:
fmt.Println(po.maxPrefetch, fc.count())
. (is our flow control exhausted?)fmt.Println(len(msgs))
. (are we sending Receive requests, but not receiving anything?)var cnt int64
at the global level in subscription.go, atomic.AddInt64(&cnt, int64(len(msgs)))
around L548, and atomic.AddInt64(&cnt, -1)
and a fmt.Println(atomic.LoadInt64(&cnt))
in the doneFunc. If you see this value steadily increasing over time, it means some messages are not being acked/nacked.Could you provide a minimal reproduction?
The minimal repro is a very-nice-to-have, but we can make do with the first two. :)
We provide each worker with a custom
Message
interface that has aDone(ack bool) error
method signature. Every worker has adefer m.Done(false)
which acts as a no-op ifm.Done
has been called.
Quick note @cristiangraz : each Ack/Nack after the first is already a no-op on our side, so the ack bool
may be superfluous.
Also, I don't suppose you have an opencensus exporter configured? If you have a stackdriver exporter set up, for example, we can use some OC metrics to help debug.
@jadekler Thanks for the detailed reply and investigation.
We ack and nack every message that is processed -- I've double and triple checked. The only scenario where that wouldn't happen is if the worker's code didn't actually get called.
A year or so ago this library was sending significant amounts of duplicate (and concurrent) messages. We wrapped all of our workers in SQL locks by subscription/msg id to prevent concurrent or duplicate work. One possibility for failing to ack/nack is if we fill up our max prefetch waiting for locks before we were able to dispatch the worker's function where the ack or nack happens. In that scenario, we'd basically fill up our queue with workers waiting to start.
I'm not exactly sure how that could happen given some other protections we have in place, but to be on the safe side, after opening this issue I:
In terms of this library, would it make sense to have a safeguard to return an error if the flow control is exhausted beyond a configurable period of time? So for example, if the flow control is exhausted with no new messages coming in for var maxFlowControlExhausted time.Duration
, a pubsub.ErrFlowControlExhausted
error is returned from receive()
?
If I know that my workers take no more than 5s to run, and I set the config at 10s, an error can be returned anytime the flow control is exhausted for more than 10s with no change.
I'm thinking of this more as a safeguard that could either prevent this issue now, or if it's not happening, serve as a protection in the future from that ever occurring.
Thanks, I look forward to hearing back from your investigation.
In terms of this library, would it make sense to have a safeguard to return an error if the flow control is exhausted beyond a configurable period of time? So for example, if the flow control is exhausted with no new messages coming in for
var maxFlowControlExhausted time.Duration
, apubsub.ErrFlowControlExhausted
error is returned fromreceive()
?
cc @kolea2 @kir-titievsky for feature request.
@jadekler After implementing these changes, I have not seen this issue appear again. Since we have not yet heard from @limbalrings, I'm going to close this issue.
Thanks for all your help and pointing me towards the "flow control exhausted" possibility. The timeouts and deferred nacks (as a failsafe) now act as an important safeguard for us in synchronous mode. Hopefully this is helpful to others who may run into this issue in the future.
Thanks for reporting back, @cristiangraz :+1:
Hi,
we ran in the same issue. We didn't nack messages on purpose, as a way to wait before we retry on error. It's not really clear (to me) from the documentation that you need to call (n)ack for every message; I interpreted the documentation that calling nack is optional, and it only leads to quicker retries. Maybe the documentation can be improved?
Is there be a way to add a pause before a message is retried?
@alicebob There is not. All messages must be acked/nacked. Please let us know if there is a place we can make that more clear. (by filing a new issue)
Client
pubsub
Describe Your Environment
alpine:3.8
running binary built withgolang:1.11.5-alpine3.8
running in a GKE Kubernetes pod.google-cloud-go version =
v0.35.1
grpc version =v1.18.0
Expected Behavior
In synchronous mode, messages continue pulling when there are messages still on the queue to pull. If messages are not able to be pulled,
Receive()
should exit with an error.Actual Behavior
Receive()
hangs on occasion but does not always exit or return an error.This causes our worker to appear to be running while no messages are processed. Once we've noticed, restarting the worker (which initiates a fresh
Pull()
will immediately process all of the messages that were sitting on the queue).Also I want to be clear that sometimes
Receive()
does return an error and we see it logged. But not always, leading to a worker that is still technically running but not actually processing any messages.Code sample
On
ReceiveSettings
, we setMaxOutstandingMessages=30
andSynchronous=true
. In other subscriptions this has happened withMaxOutstandingMessages=10
,MaxExtension=4 minutes
, andSynchronous=true
The caller runs in an infinite
for
loop with aselect
on eitherctx.Done()
or a message being received onch
. If the channel is closed we exit. Anytime we shut down (either due to receiving aSIGTERM
/SIGKILL
, context cancelation, messages failing to pull as logged above) we log the error, exit, and restart the pod.We do observe the
error receiving messages
line occasionally logged, and the shutdown behavior of our code/pod works as expected. But there are times where no error occurs but no new messages are being pulled (when there are messages on the queue waiting to be processed).We also tend to process fewer messages than others on these queues. In some cases a few hundred messages per day, and some days there are zero messages processed. I've been able to go back and see the hanging originate in the middle of messages being processed (vs hanging after a long period of "quiet" time).
Options explored
In synchronous mode, only
MaxOutstandingMessages
are pulled at a time. We use SQL locks to protect against race conditions/concurrent message processing as well as to avoid duplicate work on a message. We use exponential backoff and a jitter and only hold the messages for 30s to avoid head-of-line blocking.I suspected that if the worker was holding
MaxOutstandingMessages
thenPull()
would not pull any others until one or more slots opened up. However, in looking at the number of outstanding messages and SQL locks open there were only 8 of the 30 possible messages. We also cycle those out every 30sec so messages should still get through, even if there is a bit of a lag. What we're seeing is no messages getting through.In one case (on another subscriber) over a 16 day stall period where hundreds of messages were continuously being added to the queue but not being pulled for the 16 days until we caught it and restarted.