googleapis / google-cloud-go

Google Cloud Client Libraries for Go.
https://cloud.google.com/go/docs/reference
Apache License 2.0
3.69k stars 1.26k forks source link

pubsub: Large number of duplicate messages suddenly #778

Closed cristiangraz closed 5 years ago

cristiangraz commented 6 years ago

I've been noticing an influx of duplicate messages. Previously I don't think I had ever came across one, suddenly started seeing large volumes of dupes. For example out of 50 messages that were all ACKd just recently, I saw 48 duplicates. It sounds similar to this issue: https://github.com/GoogleCloudPlatform/google-cloud-java/issues/2465

When I look in Google Cloud Console at the API requests, I'm seeing large numbers of 499 error codes. In the last 4 days I have 1,312 200 error codes, but 7,580 499 error codes.

MaxOutstandingMessages = 10 MaxExtension = 10 minutes

Versions:

pongad commented 6 years ago

@cristiangraz We have a working reproduction in Java. I'll see if I can also repro it here. Could you let us know how long the deadline on your subscription is?

cristiangraz commented 6 years ago

@pongad Is the deadline different from MaxExtension in the ReceiveSettings struct? Just want to confirm I give you the right thing, MaxExtension is 10 minutes and MaxOutstandingMessages in this instance was set to 10.

pongad commented 6 years ago

@cristiangraz Sorry for the confusion, I was referring to deadline configured on pubsub subscription. You can get it from either the cloud console, or run

cfg, err := subscription.Config(ctx)
_ = err // handle err
fmt.Println(cfg.AckDeadline)

I sent a PR which should fix the problem to the Java repo. I plan to fix all the desgin concerns there first, then I'll replicate it here.

cristiangraz commented 6 years ago

@pongad the deadline on the subscription is set to 10 seconds (from cloud console). Also in this instance all of the workers are completing in ~2-3 seconds once they start receive the message.

jfbus commented 6 years ago

Same here. We tried using MaxOutstandingMessages = 10 and got lots of duplicates. Switching back to the default was much better. AckDeadline was also set to 10sec.

pongad commented 6 years ago

This symptom suggests that this is the same problem we're seeing in Java.

Our diagnosis is this: The perceived communication latency between pubsub server and the client can be high, so the client thinks messages expires further into the future than they actually do. By the time the client send "please give me more time" request to the server, it's too late: the server already considered the messages expired and resent them.

cristiangraz commented 6 years ago

@pongad With a pubsub deadline (in cloud console of 10s): Pubsub starts counting the 10s before client starts counting 10s (due to latency) -- so pubsub has already resent the message before client could request an extension? Am I understanding that correctly?

If a worker on our end is processing a message in < 2s, does that mean there could up to 8s of latency between the pubsub server and our client? Is there something on the Pubsub server that has caused latency to increase recently? We're running on Google Container Engine, so trying to grasp what the latency issues might be inside the datacenter from GKE to Pubsub.

If we changed the subscription deadline to 30s for example, would that solve this issue?

We have MaxOustandingMessages (of 10 in this instance) equal to the number of goroutines processing messages (in order to avoid pulling messages before they are ready to be processed).

Thanks for all your help so far.

pongad commented 6 years ago

@cristiangraz There are a few details, so I need to answer out of order.

pubsub has already resent the message before client could request an extension? Am I understanding that correctly?

Yes, we believe that's the root cause.

If a worker on our end is processing a message in < 2s, does that mean there could up to 8s of latency between the pubsub server and our client?

There are a few variables at play. According to this comment each response contains "a single batch as published". So, if you set MaxOutstandingMessages = 10, the server might send us 11 messages in the same response. The client will process up to 10 messages concurrently, but the 11th message needs to wait until one of the first 10 finishes. So the 11th message might appear to take 4 seconds to process, etc.

Is there something on the Pubsub server that has caused latency to increase recently? We're running on Google Container Engine, so trying to grasp what the latency issues might be inside the datacenter from GKE to Pubsub.

I'm actually not 100% sure. After I reproduced the problem, the logs showed that there has to be some latency going on, but I cannot tell which component (gRPC? pubsub server? firewall? etc) is introducing the latency.

If we changed the subscription deadline to 30s for example, would that solve this issue?

That would certainly help. I should note that if you set it too high, it will take longer for the message to be redelivered if your machine crashes. If you decide to increase the deadline, could you let us know if it's helping?

I created a PR to Java client (after merging, I can replicate in other langs) that will make the client send "I need more time" message to the server earlier to account for the latency.

zhenjl commented 6 years ago

I am running into the same issue and it's incredibly frustrating. In my debugging, I set ReceiveSettings.NumGoroutines = 1, MaxOutstandingMessages = 1 and ReceiveSettings.MaxExtension = 10 * time.Minute.

With the default subscription.AckDeadline = 10seconds:

  1. If I ack the msg as soon as I receive it, I get 100% deletion rate.
  2. If I wait 100ms after (just sleep), I deleted 196 of 200 msgs.
  3. If I wait 200ms after, I deleted 102 of 200 msgs.
  4. If I wait 300ms after, I deleted only 68 of 200 msgs.

I also created a subscription w/ AckDeadline: 5 * time.Minute. the deletion/ack rate is slightly better. The last run I did w/ 300ms delay ack'ed 123 of 200 msgs.

Did Google Pubsub change something on their end? Seems like it's all happening in the last couple of days.

cristiangraz commented 6 years ago

@pongad I tried to change the subscription deadline in the cloud console, but I didn't see any option for it. Is there any way to get more info internally from the Pubsub team about possible latency issues? It's definitely worrisome to see latency > 10-20ms inside of GKE, let alone the possibility of latency in the several seconds (if it's not related to streaming pull).

w.r.t. MaxOutstandingMessages and streaming pulls -- So if I set a value of 10, I should receive 10 messages? The only exception being that messages published in batch are also fetched in batch causing the streaming pull to possibly retrieve more than 10 messages and wait to process them? If that's correct, in our instance we were seeing duplicate messages mostly in an our domain system from an order. The order items are batch published in a single call to the fulfillment system, but the fulfillment system handles each message individually and publishes one at a time to the domain system (no batch publishing). They all are published around the same time, but not in a batch call. The domain system is where I am seeing the largest number of duplicates. If that's the only exception to more than MaxOutstandingMessages being received, then in our case there shouldn't be any messages that have been pulled but are waiting to begin processing? Is it possible something else is causing the latency?

This is only anecdotal as I haven't setup a test as thorough as @zhenjl, but from what I can observe duplicates are more likely when multiple messages are coming in around the same time (although not originating from a batch publish). When a worker only has a single message to process, I don't see any signs of duplicates so far -- even in the same workers.

pongad commented 6 years ago

I made a CL that should help with this.

@zhenjl Thank you for the measurements. This is in line with our expectation. If you ack the message very quickly, the acks will reach the server before the server expires the message, so there'd be fewer duplicates.

@cristiangraz You're right, I'm sorry. The only way to "change" the deadline right now is to delete the subscription and recreate. Hopefully the CL would mean you don't have to do that.

The fact that you're seeing this problem in your domain system is indeed suspicious. Could you let us know how you're publishing? If you're publishing by topic.Publish in the pubsub package, the publisher tries to batch messages together to increase performance. Could that be the problem?

I'll continue to work with the pubsub team to fix this.

cristiangraz commented 6 years ago

@pongad Yeah we are using topic.Publish -- I suppose if several independent goroutines are calling Publish the client could be batching them.

pongad commented 6 years ago

We investigated this more. We believe the messages are buffered on a machine between your VM and the pubsub server. This explains the latency: pubsub server can send messages into the buffer faster than we can read it out, so some messages spend a few seconds in there. We're working to fix this.

In the immediate term, you can try reverting to v0.11.0. That version pulls messages from a slower endpoint that shouldn't be affected by this problem. If you see compile error like undefined: gax.MustCompilePathTemplate, you need to also revert gax-go to version v1.0.0. Please let me know if this helps.

cristiangraz commented 6 years ago

Thanks @pongad. We rolled out a fix for all of our workers on Friday so they are unique by (subscription name, message id) using SQL locks, so we're currently protected from duplicate concurrent messages. I had some issues with v0.11.0 so am going to stay pinned to the current version until a fix is ready.

I appreciate all of your help digging into this, it sounds like you may have found the cause behind the latency. If you need any additional info please let me know.

cmoad commented 6 years ago

+1 on this issue. We've been trying to determine for days why we aren't processing through our queues and reverting to v0.11.0 seems to have resolved the issue. We were showing a ton of Google Cloud Pub/Sub API errors in our cloud console.

ProfessorBeekums commented 6 years ago

I'm also seeing this issue. I had actually changed my ack deadline in the Google Cloud UI to 300 seconds, but noticed my messages being redelivered every 10. The UI showed me the deadline was 300 seconds though.

Recreating my subscription fixed that problem, but is there a reason that edit option isn't disabled if it isn't supported?

jba commented 6 years ago

@ProfessorBeekums Subscription.Receive gets the ack deadline from the service at the beginning, and then remembers it for the life of the call. You have to call Receive again to get the change.

tecbot commented 6 years ago

Any updates on this issue? We are facing the same issue and this is a real pain.

jba commented 6 years ago

It's an active area of work internally. There should be fixes rolled out in the next couple of weeks.

tmatsuo commented 6 years ago

I'm getting the same problem with this version: https://github.com/GoogleCloudPlatform/google-cloud-go/commit/8c4ed1f54434ff9ea67929c91a4a10db57a52780

My theory of this problem is the following:

This line: https://github.com/GoogleCloudPlatform/google-cloud-go/blob/8dff92c85f4225d90bdd100ea741d9903acc259e/pubsub/iterator.go#L58

It always uses the 5 seconds before the deadline. Let's say we have 10 seconds ack deadline on the subscription.

OK case:

Time Client Server
0 Pull a message Send a message
5 Ask for extension Quickly accept the extension

Bad case (assuming it retries on failure):

Time Client Server
0 Pull a message Send a message
5 Ask for extension Somehow Erroring, hanging almost 5 seconds
10 Retrying modifyAckDeadline Got it, but it's too late

Originally I started to experience this problem with this issue: https://github.com/GoogleCloudPlatform/google-cloud-go/issues/382

As I recall, when I used the legacy client lib, I manually send modifyAckDeadline by myself, with a grace period of 15 seconds with 60 seconds of AckDeadline. At that time, it works much much better.

That makes me think of something like:

keepAlivePeriod := Max(po.ackDeadline - 5*time.Second, po.ackDeadline * 1/4)

for an easy mitigation. WDYT?

pongad commented 6 years ago

@tmatsuo From our analysis, things look slightly different from what you just mentioned.

In your table, the server sends and the client receives at about the same time. Our analysis shows that this isn't really the case. There could be significant buffering between client and server. So things might look kind of like this instead:

Time Client Server
0 Buffers (the client hasn't "seen" the message) Send a message
6 Sees message
10 Message expires
11 Sends modack Got it, but too late

Your solution might help to a degree. However, it's always possible for enough messages to back up that changing keep alive period won't help.

We're fixing this partly server-side so things look like this:

Time Client Server
0 Buffers (the client hasn't "seen" the message) Send a message, but pause the "clock"
6 Sees message; notifies server [1] starts the clock, message expires at time=16
11 Sends modack Got it, with time to spare

The change [1] is already made to the client here.

Last week, we estimated that the server-side fix should be rolling out this week. I'll keep this thread updated.

tmatsuo commented 6 years ago

@pongad Alright, it sounds promising :) Thanks!

tmatsuo commented 6 years ago

@pongad

FWIW, although your scenario looks promising, I think the suggestion of mine or similar approach will also be needed, because I often observe duplicated messages certain time after the pull request was made.

pongad commented 6 years ago

duplicated messages certain time after the pull request was made

I'm not 100% sure I parse this right. Are you saying the duplicate arrive a certain time after you ack it?

tmatsuo commented 6 years ago

@pongad

No. I mean that, it seems that the dups are happening for me at not only the first iteration of modack, but also on one of the subsequent iterations of modack. I'm not 100% confident though.

tmatsuo commented 6 years ago

@pongad I think some of the modack failures in our case might have been because of network congestion (and short grace period of course), sorry for the noise.

pongad commented 6 years ago

The server-side fix has landed. Could you try again and let us know if you're seeing fewer dups?

There is an untagged change on master that takes full advantage of the server feature, but we believe the fix should also work on the latest tagged version.

jfbus commented 6 years ago

Still having the same number of duplicates here (europe-west-1)...

jba commented 6 years ago

What commit are you running the Go client at?

jfbus commented 6 years ago

Still using v0.12.0

pongad commented 6 years ago

Thank you for letting us know @jfbus . I have reported this to the pubsub team.

dengliu commented 6 years ago

@pongad
Is it expected to be fixed in v0.16.0 ? I have tried the v0.16.0 Go client but still see message duplicated. What I am doing is sleep 10 seconds and ACK. the "Acknowledgment Deadline" was set to 300 seconds.

pongad commented 6 years ago

@dengliu I believe the fix hasn't been tagged yet. Would it be possible for you to try from master?

roger6106 commented 6 years ago

It looks to me like the change was added in v0.16.0 with https://github.com/GoogleCloudPlatform/google-cloud-go/commit/c9fc9dd1cdea3f1cfd34d266f624232ddf462083.

I am still having this issue on v0.17.0. In my case I suspect that my upload speed is getting saturated. I think some clarification on connection delays would be beneficial on this article: https://cloud.google.com/pubsub/docs/faq#duplicates

dengliu commented 6 years ago

I tested the go client with v0.16.0 this week , but it didn't seem to be fixed.

pongad commented 6 years ago

@roger6106 @dengliu The pubsub team just rolled out more fixes that should further reduce the duplicate. The fix landed only a couple of days ago. Could you let us know if you're still seeing this problem?

If you are, could you let us know how long you're taking to process each message? It'd help us reproduce the problem.

dengliu commented 6 years ago

@pongad what did u rolled out, anything changed on the server side? We are seeing issues that pull rate sparodically become much lower than normal since 3 days ago.

pongad commented 6 years ago

@dengliu I checked with the pubsub team; this is indeed a problem on the server. They are working to resolve this.

marklr commented 6 years ago

We're also seeing strange pull/push ratios: ~3k published to a topic, with a single sub (single & several consumers) yielding ~200 msgs/sec consumed.

pongad commented 6 years ago

@marklr I'm not completely sure I follow. How quickly do you process the messages? Receive limits the number of concurrently processing messages to make sure your computer doesn't run out of resources. So if you take a while to process messages, it will pull slowly.

marklr commented 6 years ago

Hey @pongad,

I've experimented with multiple consumers, both within GCP's network and outside it, with multiple variations of the consumer code (Python and Go).

I tried truncating the handler to simply ack the message and still end up with a consumption rate of 10% of the publishing rate.

I also experimented with the Cloud Dataflow pubsub to bigquery template and its consumption rate was similar.

I have a single sub, with a 10m deadline and consumers with a minimal handler - checking the publishTime field vs now yields 0 so there's no delivery delay... And yet I'm seeing less than 100 messages per second vs 3k being published.

pongad commented 6 years ago

@marklr Can you see if suggestions in this thread helps you? If you have further question, let's discuss on the linked thread instead. Your concern seems to be around performance, not message duplication.

anorth2 commented 6 years ago

Exact same issue as marklr. There is something seriously messed up with the current clients. The Go client is at least reliable compared to the Python client, but right now (with 10 pubsub consumer replicas) in GKE we get spikes to 700ish pull operations per second (but looking at the graph it has stabilized to 70/s), spike of 70 acks/s (stabilized to 1/s) and around 10,000 modify ack deadline operations per second (constant, no correlation to consumption rate). We have the same behavior with an ack deadline set to 10 seconds, and to 100 seconds.

What the hell. This has been an issue since September. Why hasn't it been fixed.

jba commented 6 years ago

@anorth2, the issue @marklr is having has to do with low message processing rates in Subscription.Receive. Is that what you're experiencing? If so, what are your ReceiveSettings?

anorth2 commented 6 years ago

@jba sub.ReceiveSettings.MaxOutstandingMessages = 25

jba commented 6 years ago

As mentioned in #824, you can turn on the firehose by setting both MaxOutstanding fields to -1.

I'm curious about the stats you quoted. How did you measure the number of modify ack deadline operations, for example? When you use Receive, there shouldn't be any ModifyAckDeadline RPCs. Nor should there be Acknowledge or Pull RPCs. Those are all replaced by the StreamingPull call.

What version of the Go client are you using?

anorth2 commented 6 years ago

@jba apologies, they are all streamingpull operations. Measured via stackdriver. Why would setting MaxOutstanding to unlimited increase the ack rate when the message pull rate is already close to 1000/s?

pongad commented 6 years ago

@jba, if you go to "Monitoring" in cloud console then resources -> pub/sub, you can pull up metrics about your subscription. It separates acks from the rest.

@anorth2 Is there any details you can provide? How long you need to process a message? Does the issue occur right at start up or after 10 hours? etc. We have a load test we periodically run and we are able to consistently pull and ack ~300MB worth of messages every second. Perhaps it's a bug that doesn't act up in our test cases.

I (or Jean) can investigate this.

anorth2 commented 6 years ago

@pongad We changed our consumer to ack immediately (to remove any possibility of time out). It behaves this way immediately. Notably this behavior only exists in kubernetes. Running locally (inside a docker container) it executes perfectly.

jfbus commented 6 years ago

I don't know if it's relevant, but our consumers (affected by the duplicates issue) also run on kubernetes (GKE).