Sometimes publishes tens of thousands of duplicate copies of messages

mgabeler-lee-6rs commented 6 months ago

My team has started observing intermittent (but frequent) instances where an individual message publish call we make to this library results in tens of thousands of duplicate copies being sent to the PubSub servers.

This is symptomatically similar to old issue #847, but it is happening to us even at very low publish rates (single digit messages per second).

In one particular incident we were able to map through our trace logs, we could confirm that a single call to the publish method led to over 15 thousand copies being sent. Meanwhile the client library eventually reported a send timeout in this case, though other cases do not always result in a send timeout. We're confident that it's copies being sent by the library and not duplicate deliveries of some sort because we have been able to capture sufficient logging to confirm that the duplicates all have different PubSub message IDs despite having the identical payload, and because we can observe the network traffic from our applications (running in GKE) spiking to 10x or more their normal amount with perfect correlation to these events, and to corresponding spikes in stackdriver publish requests/operations counters.

We have been attempting to make a small reproducer application, but have been unsuccessful so far, despite this happening several times an hour across our production deployments. We have also been unsuccessful at reproducing this in staging deployments.

There is some soft evidence that network or pubsub server conditions have an influence on this, as we have many independent GKE deployments (same cluster, same app images, but different customers, no shared data between them), and there is a significant correlation across these deployments of when this happens -- we will get clusters of time when this is impacting many deployments, and then periods when it is impacting almost none of them.

If it is relevant, our applications make heavy use of ordered publishing, are almost never sending more than one message at a time per ordering key, but may have low tens of ordering keys in flight at the same time. Rarely we may have a hundred or so ordering keys in flight, but that is rare and we see this happening regularly with single digit numbers concurrently in flight.

There is a high, but not perfect, correlation with publish requests experiencing this duplication eventually ending in a Total timeout of API google.pubsub.v1.Publisher exceeded 120000 milliseconds before any response was received error.

Tracing through older versions that are still deployed, we can find no instances of this happening when we were using version 3.0.1. We don't have much data for anything between that and v4.0.7. We ~are trying out the latest v4.3.3 now, but nothing jumps out in the changelog as being likely related to this issue so far~ have successfully reproduced this issue with v4.3.3 too.

We have some limited gRPC trace logs captured from these incidents, but they didn't capture any reasons for the duplicate sends, only more confirmation that it is coming from the client library. The trace logs were captured with GRPC_VERBOSITY=DEBUG and GRPC_TRACE=channel,subchannel,call_stream, and show a thundering horde of createRetryingCall, createLoadBalancingCall, createResolvingCall trios to method="/google.pubsub.v1.Publisher/Publish", but nothing else for the publishes. We can update the GRPC_TRACE setting if there are suggestions for additional elements that will help diagnose this without slowing down the app too much with excessive logging.

While most of our apps both publish & subscribe (streaming pull), this has been confirmed in at least one app that only publishes -- the "15k" example above happened to come from that app.

We have support case 50200427 open for this, but are filing an issue here because it seems very likely at this point that this is a bug somewhere in the pubsub + gax + grpc stack.

Environment details

OS: Linux
Node.js version: 16.20.2
npm version: N/A (using rush + pnpm)
@google-cloud/pubsub version: 3.0.1, 4.0.7, 4.3.3

Steps to reproduce

Publish a few messages per second to a handful of ordering keys at a time
Watch network traffic, timeouts, and the pubsub.googleapis.com/topic/send_message_operation_count metric to see if suddenly hundreds of messages per second are being published instead of less than ten

Edits:

We've reproduced it with v4.3.3 too now
We've reproduced it with v3.0.1 now as well, though it seems to be at a reduced rate there

mgabeler-lee-6rs commented 6 months ago

After much gnashing of teeth, we finally found some diagnostic data for this, while using the latest v4.3.3 release.

We disabled all publish retries within this library and its dependencies by way of gaxOpts in the publisher options:

{
    gaxOpts: {
        maxRetries: 0,
        retry: {
            retryCodes: [
                // this has to be non-empty to be obeyed, but we don't want it to
                // match on anything
                NaN,
            ],
        },
    }
}

This allowed us to see what the error was that (presumably) triggered this retry flood. What we got back was:

message: 14 UNAVAILABLE: The request deadline has expired before the operation could complete
stack:
Error: 14 UNAVAILABLE: The request deadline has expired before the operation could complete
    at callErrorFromStatus (/usr/src/app/common/temp/node_modules/.pnpm/@grpc+grpc-js@1.10.3/node_modules/@grpc/grpc-js/src/call.ts:82:17)
    at Object.onReceiveStatus (/usr/src/app/common/temp/node_modules/.pnpm/@grpc+grpc-js@1.10.3/node_modules/@grpc/grpc-js/src/client.ts:360:55)
    at Object.onReceiveStatus (/usr/src/app/common/temp/node_modules/.pnpm/@grpc+grpc-js@1.10.3/node_modules/@grpc/grpc-js/src/client-interceptors.ts:458:34)
    at Object.onReceiveStatus (/usr/src/app/common/temp/node_modules/.pnpm/@grpc+grpc-js@1.10.3/node_modules/@grpc/grpc-js/src/client-interceptors.ts:419:48)
    at /usr/src/app/common/temp/node_modules/.pnpm/@grpc+grpc-js@1.10.3/node_modules/@grpc/grpc-js/src/resolving-call.ts:132:24
    at processTicksAndRejections (node:internal/process/task_queues:78:11)
for call at
    at ServiceClientImpl.makeUnaryRequest (/usr/src/app/common/temp/node_modules/.pnpm/@grpc+grpc-js@1.10.3/node_modules/@grpc/grpc-js/src/client.ts:325:42)
    at ServiceClientImpl.<anonymous> (/usr/src/app/common/temp/node_modules/.pnpm/@grpc+grpc-js@1.10.3/node_modules/@grpc/grpc-js/src/make-client.ts:189:15)
    at /usr/src/app/common/temp/node_modules/.pnpm/@opentelemetry+instrumentation-grpc@0.45.1_@opentelemetry+api@1.7.0/node_modules/@opentelemetry/instrumentation-grpc/src/grpc-js/clientUtils.ts:177:27
    at /usr/src/app/common/temp/node_modules/.pnpm/@opentelemetry+instrumentation-grpc@0.45.1_@opentelemetry+api@1.7.0/node_modules/@opentelemetry/instrumentation-grpc/src/grpc-js/index.ts:488:12
    at AsyncLocalStorage.run (node:async_hooks:330:14)
    at AsyncLocalStorageContextManager.with (/usr/src/app/common/temp/node_modules/.pnpm/@opentelemetry+context-async-hooks@1.18.1_@opentelemetry+api@1.7.0/node_modules/@opentelemetry/context-async-hooks/src/AsyncLocalStorageContextManager.ts:40:36)
    at ContextAPI.with (/usr/src/app/common/temp/node_modules/.pnpm/@opentelemetry+api@1.7.0/node_modules/@opentelemetry/api/src/api/context.ts:77:42)
    at ServiceClientImpl.clientMethodTrace [as publish] (/usr/src/app/common/temp/node_modules/.pnpm/@opentelemetry+instrumentation-grpc@0.45.1_@opentelemetry+api@1.7.0/node_modules/@opentelemetry/instrumentation-grpc/src/grpc-js/index.ts:481:28)
    at /usr/src/app/common/temp/node_modules/.pnpm/@google-cloud+pubsub@4.3.3/node_modules/@google-cloud/pubsub/src/v1/publisher_client.ts:322:25
    at /usr/src/app/common/temp/node_modules/.pnpm/google-gax@4.3.1/node_modules/google-gax/build/src/normalCalls/timeout.js:44:16
    at repeat (/usr/src/app/common/temp/node_modules/.pnpm/google-gax@4.3.1/node_modules/google-gax/build/src/normalCalls/retries.js:80:25)
    at /usr/src/app/common/temp/node_modules/.pnpm/google-gax@4.3.1/node_modules/google-gax/build/src/normalCalls/retries.js:119:13
    at OngoingCall.call (/usr/src/app/common/temp/node_modules/.pnpm/google-gax@4.3.1/node_modules/google-gax/build/src/call.js:67:27)
    at NormalApiCaller.call (/usr/src/app/common/temp/node_modules/.pnpm/google-gax@4.3.1/node_modules/google-gax/build/src/normalCalls/normalApiCaller.js:34:19)
    at /usr/src/app/common/temp/node_modules/.pnpm/google-gax@4.3.1/node_modules/google-gax/build/src/createApiCall.js:108:30
    at runMicrotasks (<anonymous>)

The combination of code 14 UNAVAILABLE with the message text for deadline exceeded is strange, at least to me.

Further, we can see in the logs that the messages that triggered this error were sent, as our error logging recorded a unique ID from the payload, and we can see that payload being processed by the subscribers 10 seconds (+/- 100ms) before this error was thrown on the publisher side.

I infer that, with retries enabled, it would have returned this same or some similar error at high frequency causing the over-publishing of the single message.

With the retries disabled within PubSub/gax/grpc, we are not seeing the "over-publishing" any more (thankfully our apps have their own retry logic sufficient for these cases), however that is not a desirable situation long term.

To try to pin down more details on these, we have been running with limited GRPC trace logging enabled. We updated that to be:

GRPC_TRACE=channel,subchannel,call_stream,load_balancing_call
GRPC_VERBOSITY=DEBUG

With this we were able to capture another instance, and extracted these lines as the ones that appeared related to the failing call:

D 2024-03-26T12:45:02.710Z | v1.10.3 7 | channel | (2) dns:us-central1-pubsub.googleapis.com:443 createLoadBalancingCall [2898] method="/google.pubsub.v1.Publisher/Publish"
D 2024-03-26T12:45:02.710Z | v1.10.3 7 | load_balancing_call | [2898] start called
D 2024-03-26T12:45:02.710Z | v1.10.3 7 | load_balancing_call | [2898] Pick called
D 2024-03-26T12:45:02.710Z | v1.10.3 7 | load_balancing_call | [2898] Pick result: COMPLETE subchannel: (21) 108.177.111.95:443 status: undefined undefined
D 2024-03-26T12:45:02.710Z | v1.10.3 7 | load_balancing_call | [2898] startRead called
D 2024-03-26T12:45:02.710Z | v1.10.3 7 | load_balancing_call | [2898] Created child call [2899]
D 2024-03-26T12:45:02.710Z | v1.10.3 7 | load_balancing_call | [2898] write() called with message of length 800
D 2024-03-26T12:45:02.716Z | v1.10.3 7 | load_balancing_call | [2898] halfClose called
Tuesday, Mar 26th 2024, 8:45am
D 2024-03-26T12:45:12.728Z | v1.10.3 7 | load_balancing_call | [2898] Received metadata
D 2024-03-26T12:45:12.729Z | v1.10.3 7 | load_balancing_call | [2898] Received status
D 2024-03-26T12:45:12.729Z | v1.10.3 7 | load_balancing_call | [2898] ended with status: code=14 details="The request deadline has expired before the operation could complete"

While I don't think there's any way for the client code to magically know that these messages actually were published server-side, it still seems like something strange is happening with the retries, because with them enabled we were seeing the retries appearing to go into a tight loop, as if the backoff settings were being ignored and retrying instantly, re-failing instantly, etc.

leahecole commented 6 months ago

Hi - I know that folks have been in touch via internal channels and that work on this is ongoing! Dropping in the comments here to link https://github.com/firebase/firebase-admin-node/issues/2495 and https://github.com/googleapis/gax-nodejs/pull/1576 as potentially related

mgabeler-lee-6rs commented 6 months ago

Thank you @leahecole. While esp. the GAX change seems related to retries, it's not clear to me how it could cause the "retries as fast as possible without any delay between attempts" behavior seen here?

leahecole commented 6 months ago

It is not clear to me either - that's what I'm trying to figure out - if it's correlated or not 🕵🏻‍♀️ stay tuned for more!

The gax change was mostly related to server streaming retries but did touch a couple of conditionals related to unary retries. I'm working with a teammate to validate if those changes are at all at play here, or if it's an issue in a lower part of the stack.

leahecole commented 6 months ago

My non update update is that the more I dig, the less I think that the gax changes are the root cause here, though they may cause a side effect (still not sure). If I've read through the notes right, in the state where this issue occurs, retryCodes is set to the empty array. When that's the case, gax does not reach the retry logic that would be used with publish and would repeat the call. This conditional prevents it from doing so which is good. In the example you shared above, when we add something to the retryCodes array, we do correctly end up in the retry code.

I tried reproducing the error conditions locally (not in GKE) and could not - I never saw that spike in send_message_operation_count. I am wondering if something about that is creating the condition in which this happens, especially given that the firestore issue I posted above also sees DEADLINE_EXCEEDED when using GKE. My hope is that the gax PR that updates grpc makes this issue obsolete. I will keep investigating to try to rule out various factors.

Sidebar - If it helps, I have used this library in the past for the cases where the timeout doesn't seem to get sent.

vbshnsk commented 5 months ago

Thanks for a great write-up. We have experienced the same thing in our production services as well, when there is a big rate of messages coming, some tend to get resent over and over

feywind commented 5 months ago

I'm starting to think that this issue is related: https://github.com/googleapis/nodejs-pubsub/issues/1213

it still seems like something strange is happening with the retries, because with them enabled we were seeing the retries appearing to go into a tight loop, as if the backoff settings were being ignored and retrying instantly, re-failing instantly, etc.

There is a logic error there in the leaser that will end up auto-dispensing messages that have already expired when the first one in the queue expires.

mgabeler-lee-6rs commented 5 months ago

@feywind That may be the source of some problems, but in my team's specific case described here, I don't think any subscription-side issues were the source of the problem, because:

CPU usage and outbound network traffic from the GKE pod doing the publishing spikes every time the duplication issue occurs, despite our application-level logging & metrics show that the attempted publish rate was still low (< 10 msgs/sec, often < 1 msg/sec).
- This aspect may bear emphasis: Many other reports of similar issues often seemed to come with large publish bursts, but ours happens with very low publish rates, by PubSub standards, and seems more related to specific error patterns, or perhaps some subtle timing issue deep in the grpc/gax code, that can be triggered by just one or a couple publish attempts running.
We were able to retrieve copies of the duplicate messages in several cases, and they showed a consistent pattern of identical payloads but different publish IDs.
We would see those duplicated payloads even in cases where we could confirm from logging that the publish call threw an error.
We were able to stop the duplicates issue by turning off all retries in the publisher configuration (see the first comment here; our apps have higher level retries that take care of failures), and doing so produced an error rate coming into that app-level retry logic consistent with the frequency of the duplicate publishing issues when pubsub client retries had been enabled.

All together this tells me that the issue described here is purely a publisher-side issue, and indeed could probably happen even if no subscription/subscriber existed.

Unless of course I'm not understanding the details in the other ticket, and the lease code mentioned is also involved in publishing somehow?

feywind commented 5 months ago

@mgabeler-lee-6rs Yes, you're right - I think I misread the comment I quoted, apologies.

feywind commented 2 months ago

@mgabeler-lee-6rs Quick update here, I think we might've found a potential new cause here, especially since you're setting retry settings. The library is taking in retry settings and replacing existing ones rather than merging them. This means that e.g. if you don't include the retry multipliers in the settings, the delay and timeouts end up becoming zero, so it just starts spamming the server with retries. Some others have seen this cause a huge publish of duplicate messages.

We have a tracking bug to update the retry settings sample, and the docs, to cover this case, but if it's still being a problem for you, you might try filling out the retry settings object fully. e.g.:

  const retrySettings = {
    retryCodes: [
      10, // 'ABORTED'
      1, // 'CANCELLED',
      4, // 'DEADLINE_EXCEEDED'
      13, // 'INTERNAL'
      8, // 'RESOURCE_EXHAUSTED'
      14, // 'UNAVAILABLE'
      2, // 'UNKNOWN'
    ],
    backoffSettings: {
      initialRetryDelayMillis: 100,
      retryDelayMultiplier: 1.3,
      maxRetryDelayMillis: 60000,
      initialRpcTimeoutMillis: 60000,
      rpcTimeoutMultiplier: 1.0,
      maxRpcTimeoutMillis: 60000,
      totalTimeoutMillis: 60000,
    },
  };

mgabeler-lee-6rs commented 2 months ago

Yes, that could explain our situation, previously we were passing:

{
  gaxOpts: {
    retry: {
      backoffSettings: {
        totalTimeoutMillis: 2 * 60_000,
      },
    },
  },
}

Our workaround for this was to fully disable retries in the Google libraries with gaxOpts.maxRetries: 0 and gaxOpts.retry.retryCodes: [NaN].

That said, I hadn't noticed delay/timeout parameters being zero when I was single-stepping through the gax code, so there may be a little more to this, not sure.

Either way, since we have to have robust retry logic at our application layer (no level of retries in this library can guarantee success, esp. when k8s may kill -9 the container :grin:), and knowing about failures sooner is much preferable in our use case, disabling retries entirely in this library has been an effective workaround/solution for us.

feywind commented 2 months ago

Node google-gax internally is ... messy. 😹 I hope it can get cleaned up more, but in the meantime, I'm glad you found a good way forward.

I'm going to go ahead and close this, and if anyone sees similar problems pop up, please comment.

googleapis / nodejs-pubsub

Sometimes publishes tens of thousands of duplicate copies of messages #1903

Environment details

Steps to reproduce