googleapis / nodejs-pubsub

Node.js client for Google Cloud Pub/Sub: Ingest event streams from anywhere, at any scale, for simple, reliable, real-time stream analytics.
https://cloud.google.com/pubsub/
Apache License 2.0
522 stars 227 forks source link

Subscription streaming pull requests are disconnecting after 15 minutes #1135

Closed feywind closed 1 year ago

feywind commented 4 years ago

This started out of this issue: https://github.com/googleapis/nodejs-pubsub/issues/979

I just reverted the change that exposed the stream cancellations, but they are still happening under the covers. I think it would behoove us to figure out why. The commonality that I've seen so far is that it generally involves crossing a Docker networking boundary. A working reproduction is available on that issue above, though it only seems to happen on grpc-js, so it might also have something to do with the Node HTTP/2 stack.

Several question marks here, and I just don't want to lose track of that investigation, even if we fix the user breakage for now.

dpkirchner commented 3 years ago

Is this issue intended to track or resolve the reconnect issues from #979 and #770 in addition to the disconnects?

jeffijoe commented 3 years ago

We are still seeing this issue, with the addition of the following being logged to console:

Connection to [object Object] rejected by server because of excess pings

mgabeler-lee-6rs commented 3 years ago

After updating our application to the latest version of this package and @grpc/grpc-js, we are now experiencing issues that sound like this at showstopper levels.

We have tried switching some of our component services to use the native grpc, but can reproduce the issue there too.

We see no errors or warnings or anything else logged from the Google code when the app stops receiving messages.

Are there any other suggestions for other workarounds?

feywind commented 3 years ago

Yikes... so the new behaviour is that it just sort of silently stops receiving messages?

Do you happen to have grpc debug output? You can get that by setting the environment variables:

export GRPC_VERBOSITY=DEBUG
export GRPC_TRACE=all
mgabeler-lee-6rs commented 3 years ago

so the new behaviour is that it just sort of silently stops receiving messages?

Correct. And this seems to happen after a period where no messages were sent. It doesn't seem to happen (or at least not happen often) in periods where messages are being sent & received regularly.

Do you happen to have grpc debug output?

Working on collecting this, yes. Hoping to do so with less GRPC_TRACE to cut down on the logging volume, and the amount or private data that may need to be elided before sharing.

feywind commented 3 years ago

And one more question, does your use case involve Docker? (GKE or whatnot...) I've found that that seems to be a commonality for this problem.

jeffijoe commented 3 years ago

I think this is what we have been experiencing for the past few years, but it went from Call canceled to silent connection drop.

@feywind this issue is happening with the real PubSub service as well, same conditions. It appears that a quiet connection is closed. It does not happen in production for us (at least not yet) because it stays busy.

mgabeler-lee-6rs commented 3 years ago

And one more question, does your use case involve Docker? (GKE or whatnot...)

Yes, we are running on GKE

We've seen things similar to this using the simulator on MacOS developer systems (so apps running on MacOS, simulator running in Docker-for-Mac), but it was very hard to collect any data or symptoms from. I'm not at all sure that actually was this issue vs. Docker-for-Mac acting up, as DfM has been unbelievably unreliable for us lately.

However seeing this in production for us correlates closely with when we updated these Google-related dependencies:

I guess I can't rule out that mismatched versions of these libraries are creating confusion, but it's nigh impossible to get these consistent I've found

feywind commented 3 years ago

My vague recollection of debugging one of the linked issues is that I only saw this happening (and it was really reproducible) when there was a Docker network boundary involved. So emulator in a container, client on GKE, or something like that. I worked with the grpc team to try to figure out why, and I don't think we ever found anything super useful in the grpc traces either. :( We thought it was at least worked around by way of papering over the reconnects, but it sounds like that's not happening either.

So I see two issues:

1) The client library behaviour is not great for this error if it just sort of silently stops working. :|

For this one, I figured that letting it retry when disconnected would at least roll us back to where we were. It seems like maybe that's not working though...

2) HTTP/2 connections mysteriously go away after a while, especially on Docker

This one I'm less sure about, but I can get my repro case going again and bug the grpc team to see if they can find anything else here. I still suspect weirdness between the Node HTTP/2 stack and Docker iptables rules used to do container isolation, but that's admittedly just a hunch so far.

mgabeler-lee-6rs commented 3 years ago

I've got GRPC_TRACE runs going, and while they have not (yet) reproduced this issue, I do see in the traces what I'm pretty sure is a bunch of StreamingPull requests ending with Deadline Exceeded, and then new replacements for them being started. This is I'm pretty sure for a service that is validly idle -- no messages being published right now, so nothing for it to receive.

Dunno if that might provide any pointers in the code as to what might be going on. If I do manage to reproduce the issue with trace logging enabled here, I'll work through stripping them of sensitive data and share.

feywind commented 3 years ago

Thanks! We can also work out a non-GitHub-public-issue way to get the logs over, if that helps.

mgabeler-lee-6rs commented 3 years ago

That may help. The other thing I've gone looking for, but been unable to find, is how to do the equivalent of GRPC_TRACE with grpc-js, if that's possible? The more services we can run with tracing the better the odds we've got of collecting a reproducing trace, and updating all our (micro)services to use the grpc module is a lot of work, certainly far more than adding some environment vars to our k8s deployments. While we did reproduce what looked like this issue once so far with that module, it's not clear if that switch-over impacts the repro rate.

feywind commented 3 years ago

@murgatroid99 might know the GRPC_TRACE answer.

murgatroid99 commented 3 years ago

grpc-js uses the same environment variables, and GRPC_TRACE=all will still trace everything.

murgatroid99 commented 3 years ago

If you want to narrow the trace output, the output format from grpc-js is <timestamp> | <tracer name> | <log text>. The value of the GRPC_TRACE environment variable can also be a comma-separated list of tracer names, and it will then log only those lines.

mgabeler-lee-6rs commented 3 years ago

setting GRPC_VERBOSITY=debug and GRPC_TRACE=all for my services using grpc-js produces no grpc debug output whatsoever

mgabeler-lee-6rs commented 3 years ago

oh, grpc-js is case sensitive as to the value of GRPC_VERBOSITY :man_facepalming: it requires DEBUG, while the native grpc recommends lowercase debug

Edit: OK, the GRPC docs at https://github.com/grpc/grpc/blob/master/TROUBLESHOOTING.md#grpc_verbosity are inconsistent. One section says the supported values are the upper-case ones, but then the examples use lowercase

jeffijoe commented 3 years ago

Grasping for straws here but I wonder if this is related: https://www.hostedgraphite.com/blog/deadlines-lies-and-videotape-the-tale-of-a-grpc-bug

mgabeler-lee-6rs commented 3 years ago

Now that we've figured out how to get the trace data working for grpc-js, my colleagues have managed to task a group of testing environments on our end to reproducing this with the trace logging enabled. This group, as a whole, has a near 100% rate over the last week of reproducing this error somehwere within the group of applications & environments being tested (which specific apps & environments hit it any given run is highly variable). It will take a few hours for the test sequence to run, and we're hitting the end of folks' days here (so may be looking at tomorrow for having the logs), but we should now be able to get repros with grpc tracing ... as long as the tracing itself doesn't change some timing behavior and suppress the bug :)

FWIW, we're using GRPC_TRACE=channel,connectivity_state,http_keepalive,flowctl,subchannel in the hopes of not needing the full super-verbose trace output

@feywind any advice on how to share these logs privately (assuming we can collect them) will be appreciated.

mgabeler-lee-6rs commented 3 years ago

We hit a repro with one of the services running the native grpc module. Sadly due to a log rotation issue, I lost the time window where the problem started, but I did capture a window where it spontaneously recovered after over 10 minutes of not receiving messages (this app logs message age when it receives surprisingly old ones, and this data aligns well with the unacknowledged messages metric).

One thing I notice applying some dumb pattern matching to this log: in much of the log it seems normal to see logging about Deadline Exceeded followed by logging from it starting a new StreamingPull. However, just before the "spontaneous recovery", there's a bunch of logging about starting new StreamingPulls, but no logging about Deadline Exceeded from prior connections.

FWIW, this app has many subscribers to different subscriptions, and at least some of them (possibly all others) were still working fine during the period where this one subscriber stopped receiving messages.

mgabeler-lee-6rs commented 3 years ago

I've got another trace, this time from grpc-js, covering both loss & spontaneous recovery. This one shows it getting a GOAWAY around the time the backlog starts (though curiously a couple minutes after the first messages it should have received, but didn't, were published).

2021-04-27T23:07:27.693Z | subchannel | 209.85.200.95:443 connection closed by GOAWAY with code 0
2021-04-27T23:07:27.693Z | subchannel | 209.85.200.95:443 READY -> IDLE
2021-04-27T23:07:27.694Z | connectivity_state | dns:pubsub.googleapis.com:443 READY -> IDLE
2021-04-27T23:07:27.737Z | subchannel | 108.177.111.95:443 connection closed by GOAWAY with code 0
2021-04-27T23:07:27.738Z | subchannel | 108.177.111.95:443 READY -> IDLE
2021-04-27T23:07:27.738Z | connectivity_state | dns:pubsub.googleapis.com:443 READY -> IDLE
2021-04-27T23:07:33.123Z | subchannel | 209.85.200.95:443 IDLE -> TRANSIENT_FAILURE
2021-04-27T23:07:33.124Z | subchannel | 108.177.111.95:443 IDLE -> TRANSIENT_FAILURE
2021-04-27T23:07:33.124Z | subchannel | 209.85.200.95:443 TRANSIENT_FAILURE -> IDLE
2021-04-27T23:07:33.124Z | subchannel | 108.177.111.95:443 TRANSIENT_FAILURE -> IDLE

Then it goes into a weird looking loop of this:

2021-04-27T23:07:48.841Z | channel | Pick result: QUEUE subchannel: undefined status: undefined undefined
2021-04-27T23:07:48.841Z | channel | Pick result: QUEUE subchannel: undefined status: undefined undefined
2021-04-27T23:07:48.841Z | channel | Pick result: QUEUE subchannel: undefined status: undefined undefined
2021-04-27T23:07:48.841Z | channel | Pick result: QUEUE subchannel: undefined status: undefined undefined
2021-04-27T23:07:48.841Z | connectivity_state | dns:pubsub.googleapis.com:443 IDLE -> CONNECTING
2021-04-27T23:07:48.841Z | channel | Pick result: QUEUE subchannel: undefined status: undefined undefined
2021-04-27T23:07:48.841Z | channel | Pick result: QUEUE subchannel: undefined status: undefined undefined
2021-04-27T23:07:48.841Z | channel | Pick result: QUEUE subchannel: undefined status: undefined undefined
2021-04-27T23:07:48.841Z | channel | Pick result: QUEUE subchannel: undefined status: undefined undefined
2021-04-27T23:07:48.841Z | connectivity_state | dns:pubsub.googleapis.com:443 CONNECTING -> CONNECTING
2021-04-27T23:07:48.880Z | channel | Pick result: QUEUE subchannel: undefined status: undefined undefined
2021-04-27T23:07:48.880Z | channel | Pick result: QUEUE subchannel: undefined status: undefined undefined
2021-04-27T23:07:48.880Z | channel | Pick result: QUEUE subchannel: undefined status: undefined undefined
2021-04-27T23:07:48.880Z | channel | Pick result: QUEUE subchannel: undefined status: undefined undefined
2021-04-27T23:07:48.881Z | connectivity_state | dns:pubsub.googleapis.com:443 CONNECTING -> CONNECTING
2021-04-27T23:07:48.881Z | subchannel | 2607:f8b0:4001:c2b::5f:443 IDLE -> CONNECTING

this continues for a bit (well, a bunch of log lines, not a lot of time) until it finally seems to give up, seems it was trying ipv6:

2021-04-27T23:07:48.885Z | channel | Pick result: QUEUE subchannel: undefined status: undefined undefined
2021-04-27T23:07:48.885Z | channel | Pick result: QUEUE subchannel: undefined status: undefined undefined
2021-04-27T23:07:48.885Z | channel | Pick result: QUEUE subchannel: undefined status: undefined undefined
2021-04-27T23:07:48.885Z | channel | Pick result: QUEUE subchannel: undefined status: undefined undefined
2021-04-27T23:07:48.885Z | connectivity_state | dns:pubsub.googleapis.com:443 CONNECTING -> CONNECTING
2021-04-27T23:07:48.908Z | subchannel | 2607:f8b0:4001:c2b::5f:443 connection closed with error connect EADDRNOTAVAIL 2607:f8b0:4001:c2b::5f:443 - Local (:::0)

this repeats for a while, until it seems to give up, e.g.:

2021-04-27T23:07:53.132Z | subchannel | 173.194.198.95:443 TRANSIENT_FAILURE -> IDLE
2021-04-27T23:07:53.132Z | subchannel | 173.194.197.95:443 TRANSIENT_FAILURE -> IDLE
2021-04-27T23:07:53.132Z | subchannel | 173.194.196.95:443 TRANSIENT_FAILURE -> IDLE
2021-04-27T23:07:53.132Z | subchannel | 173.194.195.95:443 TRANSIENT_FAILURE -> IDLE
2021-04-27T23:07:53.132Z | subchannel | 172.217.212.95:443 TRANSIENT_FAILURE -> IDLE
2021-04-27T23:07:53.132Z | subchannel | 74.125.124.95:443 TRANSIENT_FAILURE -> IDLE
2021-04-27T23:07:53.132Z | subchannel | 142.250.152.95:443 TRANSIENT_FAILURE -> IDLE
2021-04-27T23:07:53.132Z | subchannel | 108.177.112.95:443 TRANSIENT_FAILURE -> IDLE
2021-04-27T23:07:53.132Z | subchannel | 209.85.200.95:443 TRANSIENT_FAILURE -> IDLE
2021-04-27T23:07:53.132Z | subchannel | 142.250.148.95:443 TRANSIENT_FAILURE -> IDLE
2021-04-27T23:07:53.132Z | subchannel | 142.250.136.95:443 TRANSIENT_FAILURE -> IDLE
2021-04-27T23:07:53.132Z | subchannel | 142.250.125.95:443 TRANSIENT_FAILURE -> IDLE
2021-04-27T23:07:53.132Z | subchannel | 209.85.146.95:443 TRANSIENT_FAILURE -> IDLE

(note: that's maybe a fifth of the lines like that it logs over the course of a few milliseconds!)

several minutes later there's another GOAWAY:

2021-04-27T23:15:46.485Z | subchannel | 74.125.70.95:443 connection closed by GOAWAY with code 0
2021-04-27T23:15:46.485Z | subchannel | 74.125.70.95:443 READY -> IDLE
2021-04-27T23:15:46.487Z | connectivity_state | dns:pubsub.googleapis.com:443 READY -> IDLE
2021-04-27T23:15:46.577Z | subchannel | 172.217.219.95:443 connection closed by GOAWAY with code 0
2021-04-27T23:15:46.577Z | subchannel | 172.217.219.95:443 READY -> IDLE
2021-04-27T23:15:46.578Z | connectivity_state | dns:pubsub.googleapis.com:443 READY -> IDLE
2021-04-27T23:15:53.133Z | subchannel | 74.125.70.95:443 IDLE -> TRANSIENT_FAILURE
2021-04-27T23:15:53.134Z | subchannel | 172.217.219.95:443 IDLE -> TRANSIENT_FAILURE
2021-04-27T23:15:53.134Z | subchannel | 74.125.70.95:443 TRANSIENT_FAILURE -> IDLE
2021-04-27T23:15:53.134Z | subchannel | 172.217.219.95:443 TRANSIENT_FAILURE -> IDLE

a couple seconds later it does another round of CONNECTING and ipv6 failures

This pattern continues for several minutes, until finally it gets one more spasm of TRANSIENT_FAILURE -> IDLE, and finally seems to get a working connection and start new StreamingPull calls, at which point the messages flow again and the backlog drains:

2021-04-27T23:20:32.500Z | channel | dns:pubsub.googleapis.com:443 createCall [360] method="/google.pubsub.v1.Subscriber/StreamingPull", deadline=1619566532500
2021-04-27T23:20:32.501Z | channel | Pick result: COMPLETE subchannel: 74.125.70.95:443 status: undefined undefined
2021-04-27T23:20:32.502Z | channel | dns:pubsub.googleapis.com:443 createCall [361] method="/google.pubsub.v1.Subscriber/StreamingPull", deadline=1619566532500
2021-04-27T23:20:32.502Z | channel | Pick result: COMPLETE subchannel: 74.125.70.95:443 status: undefined undefined
2021-04-27T23:20:32.503Z | channel | dns:pubsub.googleapis.com:443 createCall [362] method="/google.pubsub.v1.Subscriber/StreamingPull", deadline=1619566532500
2021-04-27T23:20:32.503Z | channel | Pick result: COMPLETE subchannel: 74.125.70.95:443 status: undefined undefined
2021-04-27T23:20:32.503Z | channel | dns:pubsub.googleapis.com:443 createCall [363] method="/google.pubsub.v1.Subscriber/StreamingPull", deadline=1619566532500
2021-04-27T23:20:32.503Z | channel | Pick result: COMPLETE subchannel: 74.125.70.95:443 status: undefined undefined
2021-04-27T23:20:32.504Z | channel | dns:pubsub.googleapis.com:443 createCall [364] method="/google.pubsub.v1.Subscriber/StreamingPull", deadline=1619566532500
2021-04-27T23:20:32.504Z | channel | Pick result: COMPLETE subchannel: 74.125.70.95:443 status: undefined undefined

Edit: I have multiple services showing this pattern captured in logs now

mgabeler-lee-6rs commented 3 years ago

The issues described in https://github.com/grpc/grpc-node/pull/1322 sound somewhat similar to this at first blush?

murgatroid99 commented 3 years ago

First, you said "I do see in the traces what I'm pretty sure is a bunch of StreamingPull requests ending with Deadline Exceeded". Are you seeing the actual DEADLINE_EXCEEDED code in the gRPC status? The client should generally not set a deadline on streaming requests, so assuming that is the case, that code is probably coming from the server's application logic. I think this is worth looking into on the PubSub side.

I should have mentioned earlier that while grpc-js uses the same environment variables for tracing as the other library, it has different tracer names because it has a different internal architecture. In particular, in this case, http_keepalive, and flowctl don't correspond to any trace logs. Keepalive logs are handled by the keepalive tracer, and flow control is handled by Node internals so it doesn't get traced at all.

I don't think those logs show anything that indicates why this issue would be happening. Overall, this appears to show that grpc-js has a connection, and then the connection is dropped by the server for some reason, and then it establishes a connection again, and then it is dropped again. This is normal behavior if the server has a limit on the time a connection can be active. One important thing here is that a GOAWAY with code 0 as shown in the logs indicates that a server will reject new requests on that connection, but continue processing existing requests.

It is possible that the omitted parts in between would show something else, and if you add the pick_first tracer, and maybe also dns_resolver, the logs will probably show a clearer picture of what is happening with the connections.

That linked PR is probably not related. Each state transition has a different meaning, and I don't see any CONNECTING -> IDLE transitions here. Plus, I fixed that issue last year while the library was still in beta.

Detailed log interpretation:

The first section shows a client handling a situation where a server cleanly dropped the connection or shut down.

I think the next few sections are the result of the so-called "happy eyeballs" algorithm: when given a list of addresses, the client tries to connect to them in that order, with a small delay before trying the next one each time (if the previous one hasn't established a connection yet), then it uses whichever connection gets established first. The connection error in the third log isn't "giving up", it's just one of those connection attempts failing, while others are still being tried. And the fourth log, with the TRANSIENT_FAILURE -> IDLE transitions isn't "giving up" either. My guess is that a connection is successfully established some time between the third and fourth log sections, and the fourth log shows the other connection attempts getting discarded because they won't be used. I think the pick_first tracer will show this more clearly.

The "Pick result" logs just show that there are 4 pending requests at that time, and each one is just getting an instruction to keep waiting for a connection to be established.

After that, the next log shows another clean shutdown or dropped connection, and the last log shows requests starting to be handled because a connection has been successfully established.

mgabeler-lee-6rs commented 3 years ago

Thanks for the feedback & analysis. I figured given the old issue was fixed long ago, this wasn't the same thing again. It just sounded like similar symptoms.

We'll run more tests tomorrow with the recommended updates to the tracer list and collect fresh logs.

mgabeler-lee-6rs commented 3 years ago

First, you said "I do see in the traces what I'm pretty sure is a bunch of StreamingPull requests ending with Deadline Exceeded". Are you seeing the actual DEADLINE_EXCEEDED code in the gRPC status? The client should generally not set a deadline on streaming requests, so assuming that is the case, that code is probably coming from the server's application logic. I think this is worth looking into on the PubSub side.

sorry missed this question. What I'm seeing are logs like this, and unless I'm mistaken, code 4 is DEADLINE_EXCEEDED?

I0427 22:57:44.911352266       7 subchannel.cc:182]          OP[authority:0x4888500]:  CANCEL:{"created":"@1619564264.911314865","description":"Deadline Exceeded","file":"../deps/grpc/src/core/ext/filters/deadline/deadline_filter.cc","file_line":69,"grpc_status":4}

I'm only seeing that from the services using the "native" grpc module however, as implied by the format of that log.

murgatroid99 commented 3 years ago

You're right, and the description of "Deadline Exceeded" confirms that. However, the source is deadline_filter.cc, and I'm pretty sure that implies that that is a client-side deadline. I don't know why that would only be happening with the native library.

mgabeler-lee-6rs commented 3 years ago

I've got a "clean" repro for a service with the new trace data. Unlike the past logs, this service has just the one subscriber so when it stops receiving, there is basically no other logging whatsoever, and ~it didn't auto-recover in this case~ it took about 90 minutes to self recover.

Here's our metrics for the unack'd messages for the relevant subscription. 12:08 in this graph is 16:08 UTC: Screenshot from 2021-04-28 13-46-13

And here are the GRPC trace logs: q-b-w-s-9cf5b99b5-hctsm.1.log.gz

I've trimmed our private app logging, but put in // annotations describing roughly what they were. I also put in a similar annotation showing when the first un-received message was published from the other service. I also trimmed the early part of this logs from when the service was operating correctly, as there's a lot more of our private app logging to cleanup there. But I have all of it from the start of the service saved and can pull relevant sections on request.

I notice it seems to start a bunch of StreamingPull requests at the end ... and never gets any messages from them :(

Edited to add: while preparing this summary, the app self-resolved. Here's the logging from the end of the prior log until it finally receives a message again. The elapsed time from the publish of the first delayed message to the subscriber receive recovering is just shy of 90 minutes. q-b-w-s-9cf5b99b5-hctsm.2.log.gz

murgatroid99 commented 3 years ago

I think those expanded logs mostly confirm what I said in my previous comment. Unfortunately, I think that means that the explanation for this issue will not be found there. I think the next best step would be to add the call_stream tracer. You may even want to switch to GRPC_TRACE=all; it might not be too verbose, and it would avoid going back and forth changing the tracer options.

mgabeler-lee-6rs commented 3 years ago

You may even want to switch to GRPC_TRACE=all; it might not be too verbose, and it would avoid going back and forth changing the tracer options

Yeah, seems like I guess. I was hoping to avoid that as I expect it will make it much harder to trim logs to be safe to post here, but will go through that at this stage

mgabeler-lee-6rs commented 3 years ago

Monitoring the all traces, I happened to notice this fly by:

2021-04-28T20:22:44.751Z | call_stream | [157] Received server trailers:
                grpc-status: 8
                grpc-message: You have exceeded your StreamingPull connection quota (project="...elided..."). Refer to https://cloud.google.com/pubsub/quotas for more information.

And checking how many subs we've got in the project, and quota usage in GCP, I think this is the major problem we're hitting here.

Not being able to see this as an error or even a log at the application level is, IMO, frustrating at best. I think having subscriber.on('error') handlers get notified about this kind of issue (and any similar quota issues or other RESOURCE_EXHAUSTED type stuff) would be a good thing to add.

Apologies to the OP for unintentionally hijacking your issue tho.

murgatroid99 commented 3 years ago

Those errors really should be visible to the user. @feywind is PubSub transparently retrying RESOURCE_EXHAUSTED errors? If the server is sending errors like that the user needs to take action on, those errors really need to be exposed to the user.

feywind commented 3 years ago

@murgatroid99 It seems to be, yeah:

        "deadline_exceeded_resource_exhausted_aborted_internal_unavailable": [
          "DEADLINE_EXCEEDED",
          "RESOURCE_EXHAUSTED",
          "ABORTED",
          "INTERNAL",
          "UNAVAILABLE"
        ]

That's configured in the protos. Would the disconnect behaviour we're seeing be explained by firewall issues between the client and server? My current guess is that Docker is doing something weird that causes the connections to go away after 15 minutes.

Apologies to the OP for unintentionally hijacking your issue tho.

Oh, it was me, and no worries. :) I think this is pretty strongly related.

murgatroid99 commented 3 years ago

I don't think the logs from mgabeler-lee-6rs have anything to do with the original issue. They were just running out of quota.

So, I don't think we have any additional information about the original issue, and I don't think there's any particular reason to think it's related to firewalls.

feywind commented 3 years ago

Ah... quota logs aside, is there a reason not to believe that it has to do with the Docker networking? In my repro attempts here, it's literally 100% reproducible with Docker involved, and 100% not reproducible without. With a very small sample size, anyhow. πŸ€” I can try it again with newer versions and send all that over if it helps.

mgabeler-lee-6rs commented 3 years ago

:thinking: I think I should spin up a fresh issue from the quota finding here? it's clearly unrelated to the original. Then all the "oops not this ticket" comments can be collapsed away to clean this ticket up?

Edit: done: https://github.com/googleapis/nodejs-pubsub/issues/1275 and I've collapsed all my own comments above. Naturally I can't collapse away anyone else's though :)

Edit 2: Confirmation that my issue was just the quota one: We got our quota under control, and the problem went away.

mrothroc commented 3 years ago

Ah... quota logs aside, is there a reason not to believe that it has to do with the Docker networking?

This is becoming a critical, customer-disrupting issue for us. To eliminate Docker as a possibility, I am accelerating the migration of our node pools on GKE to the containerd image. Any other suggestions are welcome.

mrothroc commented 3 years ago

I can confirm that this happens with the containerd image as well.

mrothroc commented 3 years ago
Screen Shot 2021-05-20 at 8 38 56 PM

This is what it looks like in practice. The pods are idle, so there is no reason for them to delay consuming the messages. You can see the long pauses here before they restart message consumption.

The spike on the left is normal behavior.

feywind commented 3 years ago

@mrothroc Thanks for the update. I'm not as familiar with the ins and outs of Docker, but it sounds like (after some reading) containerd actually doesn't deal with networking. Something else would be doing that (e.g. Docker using containerd as a back-end). In that case, I'd expect to see the same behaviour in both cases.

I'm out of "office" next week, so we're trying to find someone to continue looking at this.

mrothroc commented 3 years ago

Ok, thanks. I can confirm that the issue persists even after the switch to containerd. FYI, I have opened a P2 ticket through our paid support on this issue. Let me know if you want the case number.

mrothroc commented 3 years ago

Anecdotally, it may have actually gotten substantially worse with containerd. Not sure, just suddenly getting lots of complaints.

mrothroc commented 3 years ago

The support engineer for my P2 ticket has confirmed that he can replicate the issue. He has opened a ticked with the PubSub engineering team, but since this is happening in the client library he things they are going to send me back here.

@feywind I know you were having trouble replicating this. Can you connect with the support engineer to share his setup?

crwilcox commented 3 years ago

@mrothroc I believe @feywind may be on holiday. Can you perhaps share the support engineer or a case number? You can email me if you would rather not share broadly (@google.com)

mrothroc commented 3 years ago

@crwilcox thanks for jumping in! I just sent you the info in an email.

feywind commented 3 years ago

@mrothroc Is this still breaking for you? I'm not completely sure about the mapping between internal support cases and GitHub IDs. I've got several GKE testers set up that do long term publish and subscribe, as well as a set that does very slow publish and subscribe (1 message per 20 minutes). I haven't been able to reproduce any breakage with the latest version of the library. The 15 min reconnect looks like it's still happening, but I haven't noticed it doing anything problematic to the message flow.

mrothroc commented 3 years ago

Hi @feywind apologies for the silence, but all of my activity has been on the internal support case. We did uncover one issue which, once resolved, made the situation better but did not fix it completely.

For future searchers, here is the key line from the internal case notes: "you’re using the version of Pub/Sub with the fix for stream retries (2.11.0), but the package-lock is holding you on an older version of google-gax (and @grpc/grpc-js)"

We had updated the pubsub client but apparently google-gax was not updated in our package-lock. Updating this improved things.

Since this did not entirely resolve the issue, we are currently pursuing a second idea: a limit on the concurrent streams. From the internal chat: "per the documentation it seems that an instance only can manage up to 100 stream connections at time". We were not aware that the instances had a limit on the number of streams. We're also not sure we use that many streams, but we are going to try splitting our streams across instances to test.

Once I have a final resolution, I will update this case.

feywind commented 3 years ago

Ah ha, okay, that was your case. :) I was somewhat surprised about that limit, too. Apparently this may also be of relevance, though I haven't looked into it too much yet: https://www.npmjs.com/package/grpc-gcp

jeffijoe commented 3 years ago

@mrothroc what version of @grpc/grpc-js and google-gax is needed for the partial resolution?

mrothroc commented 3 years ago

I just took a quick look but I'm not certain which branch the engineer is doing his tests. I believe it is 2.15.0 but I may be looking in the wrong branch. I will check with the engineer to see.

feywind commented 3 years ago

This is where I'm at locally, if that helps:

@google-cloud/pubsub@2.15.1
└─┬ google-gax@2.17.0
  └── @grpc/grpc-js@1.3.4