googleapis / google-cloud-go

Google Cloud Client Libraries for Go.
https://cloud.google.com/go/docs/reference
Apache License 2.0
3.72k stars 1.27k forks source link

pubsub: process hanging forever when large number of subscription combined with NumGoroutines #2593

Closed Wangxishu closed 4 years ago

Wangxishu commented 4 years ago

Client

PubSub v1.3.1 cloud.google.com/go v0.57.0 google.golang.org/grpc -> v1.29.1

Environment

GKE n1-standard-4

Go Environment

Ubuntu 16.04.4 LTS + go1.14.2 linux/amd64

Code

e.g.

sub := client.Subscription(subName)
if err != nil {
  client.Close()
  return nil, err
}
if !ok {
  client.Close()
  return nil, fmt.Errorf("pubsub subscription not exist: %s", path)
}
// maxOutstandingMessages could be 15, 100, 1000 for different subscription 
sub.ReceiveSettings.MaxOutstandingMessages = conf.maxOutstandingMessages 
// issue when numPullerGoroutines >= 7
sub.ReceiveSettings.NumGoroutines = conf.numPullerGoroutines
sub.ReceiveSettings.Synchronous = false
sub.Receive(ctx, func(_ context.Context, m *gpubsub.Message) {
  if err := handler(hctx, m); err == nil {
     m.Ack()
  } else {
     m.Nack()
  }
})

Expected behavior

Pubsub subscriptions receive messages correctly

Actual behavior

When ReceiveSettings.NumGoroutines >=7 (this is our previous setting which works fine), the process upon starting will get stuck.

Additional context We recently upgraded to google_cloud_go to 0.57.0 which comes with pubsub v1.3.1 and start seeing this issue Our process is trying to receive 17 pubsub subscriptions (one pubsub client) from GCP pubsub

By digging further, e.g. when setting NumGoroutines=8, we found that the process will immediately hit grpc.tranposrt.http2client.streamQuota (max number of grpc streams) which has default value = 100 and GCP pubsub servers seems to have it as 100 too (captured from handleSetting() handler) When this limit is hit, new grpc stream can't be created and needs to wait for the old one to close. The issue is that the process hangs here forever, all streams are waiting and not proceed any more.

It seems the number of GRPC streams = ReceiveSettingsNumGoroutines * num_of_subscriptions

The old version of pubsub (comes with google_cloud_go 0.36.0) does not have this behavior. When we set NumGoroutines=8, pubsub/grpc will try to create about 40 streams

my question:

  1. GRPC streams = ReceiveSettingsNumGoroutines * num_of_subscriptions, is this the expected behavior ?

  2. Is our way of setting NumGoroutines correct? because when we have 17 subscriptions, 17*8 = 136 streams > 100 streamQuota (at least by default) seems to be out of stream limit. This is causing degraded performance or hanging issue. What's the recommended way of setting this number when have relatively large number of subscriptions ?

Thanks for reading.

hongalex commented 4 years ago

This seems to be an issue with cloud.google.com/go/pubsub v1.3.0, where we updated the way our gRPC connections work. I can reproduce this on all versions v1.3.0 or higher (or on any modules that require google.golang.org/api v0.16.0 or higher).

In the meantime, to avoid this, please revert back to cloud.google.com/go/pubsub v1.2.0 as a temporary workaround. I'll be looking into what changes in the dependencies might have caused this and expect to have an update soon.

Wangxishu commented 4 years ago

thank you Alex for getting back to us quickly.

iagomelanias commented 4 years ago

Looks like the PR with a fix was merged, i'm just waiting for a new release. 🙂

hongalex commented 4 years ago

Thanks for your patience! This is now released in pubsub/v1.6.0.

maroux commented 3 years ago

I think this is still a problem in cloud.google.com/go/pubsub v1.11.0 / google.golang.org/grpc v1.38.0.

I have an app that pulls 27 subscriptions. Here's our unack message count graph with default setting of NumGoroutines (=10) -

Screen Shot 2021-06-24 at 12 35 09 PM

Here's ack requests graph which follows the same pattern -

Screen Shot 2021-06-24 at 12 32 20 PM

As seen, there's a cyclical pattern of processing all the messages, then failing to process for ~ couple of hours, then processing all the messages again. The time to process an individual message doesn't change, neither does the publish rate. It doesn't completely stop, just comes to crawl (going from 30K rpm to ~500 rpm).

Changing NumGoroutines=1 in every subscription leads to things processing just fine with ack requests graph going up to a roughly constant value (after 11.30) -

Screen Shot 2021-06-24 at 12 33 16 PM

I'm not entirely sure how to debug this further. I did notice the same streamQuota (100).

willie commented 2 years ago

For what it's worth, I see a saw blade like pattern like @maroux notes with several of my go pubsub client programs. I'm not up to date with what is the "highest performance" set of options for the client, but I definitely see the jagged unack graph when I look at subscriptions.

userwerk commented 5 months ago

We observed a very similar problem. In our setup we have 22 subscriptions (exactly once delivery enabled) on a single pod with 10 go routines (default setting) and we saw the zig zag pattern in the graphs as well. The subscription basically received messages but didn't pass it to the actual subscription handler in the application. It took sometimes over 2h until the message was actually processed. Then suddenly all stuck messages were processed at once, just to be stuck shortly after again. Also we received a lot of automatic lease modack retry failed for following IDs and status 3 - PERMANENT_FAILURE_INVALID_ACK_ID and rpc error: code = Unauthenticated desc = Request had invalid authentication credentials. Expected OAuth 2 access token, login cookie or other valid authentication credential. errors.

All of these problems vanished after we decreased the number of goroutines from 10 to 1. We played around with the settings to determine the break point of the system. It turns out that our setup works with 4 goroutines but breaks with 5. Now the messages are processed pretty much instantly.

We are of course hoping for a fix or at least an explanation of the bug, but we can live with the current setup for now.

akallu commented 3 months ago

@userwerk I am running into the same issue, and it seemingly happened when we upgraded our client lib version from 1.36.1 to 1.36.2. The only change in 1.36.2 (changelog here) was the PR here.

Not super certain how this could be causing this issue tbh-- it might be a red herring. I'm going to try your approach of decreasing our goroutines, and see if that works.

EDIT: Decreasing goroutine count to 1 worked! Thank you so much for your comment, it saved me a lot of headache. Will try pushing it up to 4 like you, but I'm not sure this is going to be an acceptable solution for my use-case, as it's for a pretty heavy workload. Will raise this issue further with Google. Thanks again!

hongalex commented 3 months ago

Hey @akallu and @userwerk, sorry for not commenting on this earlier. Sometimes closed issues falls off my radar as they are dependent on transient notifications that I don't see right away.

This issue is something that we're aware of and decreasing NumGoroutines = 1 basically confirms it's a stream related issue. The root cause is described here and specifically is related on messages expiring and clogging up the client library when they aren't processed in time.

I have a suspicion that the upgrade might be a red herring. I can't see how that change is related to more messages expiring. I would look into it further but decreasing streams seems to already unblocked this specific case for you.

We're looking for a fix but some other insights I want to share is that

  1. NumGoroutines=1 is actually too high for a lot of use cases. This opens up 1 stream, but each stream can deliver 10 MB/s of messages. If you aren't hitting this throughput, I recommend decreasing to just 1.
  2. You could consider more horizontal sharding (increasing number of subscriber clients each with NumGoroutines=1
  3. NumGoroutines isn't actually indicative of concurrency. The number of callbacks / messages being processed at once is tied to MaxOutstandingMessages. We're planning to rework this soon, since this value is also currently tied to stream flow control, which is what is ultimately causing this poor behavior in the first place.

Hope this helps and I'll watch this issue a bit more closely for now. If you're experiencing behavior that isn't related to streams, please open another issue. I'll also try to get plugged into the internal issue if that helps.

akallu commented 3 months ago

@hongalex Thanks for taking the time to respond, much appreciated!

Your comment makes sense for the most part, my only remaining question is around what the level of concurrency is. When you say it's tied to MaxOutstandingMessages, does that mean it's exactly that value or some multiplier on top of it?

EDIT: Nevermind, read the issue you linked and I understand now. Thanks!