Streaming pull subscriber does not notify of loss of network

surlyhacker commented 1 year ago

OS: Ubuntu 22.04
Node.js version: 18.16.0
npm version: 9.5.1
@google-cloud/pubsub version: 3.7.1

Steps to reproduce

Call PubSub.subscription(), and then setup on() handlers for 'message', 'debug', 'error', and 'close'
After initiating the initial subscription and receiving some streaming pull messages, disconnect the network on the machine
Notice that there are no error or debug events of any kind after several minutes. In other words, there is no notice of the fact that the network is gone and the subscription pull is therefore no longer operating. If the network is disconnected before the streaming pull begins, then there are debug messages and eventually there is an error after the default 5 minute timeout where the subscription is closed. But none of this happens if the network disconnects after the stream has been established.

surlyhacker commented 1 year ago

Actually more extensive testing shows that after 15 minutes, there are 'debug' events generated, but after waiting more than 1 hour, there is no 'error' or 'close' generated, so there is still no clear indication of an issue. Is there a proper way to detect this scenario that I am not using?

Edit: I think the 15 minutes is just the default behavior of GRPC and/or pubsub for restarting HTTP/2 streams so that is not directly related to the loss of network.

feywind commented 1 year ago

Sorry, fumbly fingers on the checkboxes :D

surlyhacker commented 1 year ago

Here is what I've found so far: pubsub has a keepalive mechanism it seems internally that writes an empty object to the gRPC stream wrapper every 30 seconds, but this seems as if it will never fail due to the below behavior.

pubsub uses gRPC, which uses Node's HTTP2 API.
Even that HTTP2 API does not seem to detect loss of network when trying to write() to the Http2Stream. Http2Stream uses Node's stream.Duplex class internally, which says (via the stream.Writable docs) that it will buffer writes. It would seem that the call to write() will buffer and provides no timely or clear indication that the underlying TCP socket is not receiving ack's and thus disconnected. (there is some more complexity around how it buffers in the docs; I've found on Windows that it does behave as documented but not on Linux)

Update: the default modern Linux TCP ACK timeout is about ~16 minutes after retries. Only then will the write() indicate an issue unless TCP_USER_TIMEOUT is set on the Socket. See here for some details.

So when gRPC ultimately calls write(), it also doesn't know about the disconnection. However, I see that gRPC has its own keepalive mechanism, which seems to use HTTP2 pings under the covers, and those work in this situation. In other words, when HTTP2 pings fail to receive responses, this is easy to detect and react to, which gRPC does. But this is not particularly well exposed to pubsub on top of it.

If I pass {"grpc.keepalive_time_ms": 60000} in the PubSub() constructor options (which is itself seemingly undocumented as a supported parameter), it does activate this system within gRPC and it will send HTTP2 pings and detect the loss of connection within that time as configured.

However, it seems the most viable (somewhat hack-ish) way to bubble that up to pubsub is code like this:

//mysubby is a  pubsub Subscription object;
mysubby._subscriber.getClient().then((client)=>{
  client.initialize();
  client.subscriberStub.then((sub)=>{

    var constate = sub.getChannel().getConnectivityState();
    var cb = () => {
      var newstate = sub.getChannel().getConnectivityState();
      console.log(new Date() + " Conn state changed from "+constate+" to "+newstate);
      constate = newstate;
      sub.getChannel().watchConnectivityState(newstate,Infinity,cb);

    };
    sub.getChannel().watchConnectivityState(constate,Infinity,cb);

  });
})

And then you can use the connectivity state values to know when it transitions from 2 ("READY") to 3 ("TRANSIENT FAILIURE")

watchConnectivityState docs

But I feel like using this undocumented internal API is a hack.

I love the fact that pubsub, gRPC, and actually Node's HTTP2 APIs all try to seamlessly re-establish connectivity without failing, but when taken to this extreme in a situation where (let's say) the pubsub Publisher publishes infrequently and unpredictably, it becomes very difficult to detect network issues in a timely way. And if timeliness is important, I don't know of a solution other than this hack. Would it be not using Streaming Pull? Seems extreme. I have to use pubsub because it's the only mechanism provided to me (by Google in this case) to get my data (from Google).

So I really think there should be documented public API support for this.

End note: it took me many hours of frustrating debugging to figure this out due to documentations gaps in all of these APIs

surlyhacker commented 1 year ago

Another update: if I override the default "deadline" where streams are restarted every 15 minutes to be some very long time (e.g. hours), then I can get watchConnectivityState() to trigger upon TCP socket failure (at ~16 mins on Linux) even with grpc.keepalive_time_ms not set.

I think what is happening is that if the deadline is shorter than the TCP ACK timeout, the HTTP2 streams are destroyed and re-created prior to the TCP timeout, and so no error is ever generated in those streams, even though they are being written to by the pubsub keepalive.

It is mysterious though that I guess Node ignores these errors from the underlying socket and continues to try to write to it, never propagating them to the Http2Session level, just because the original Http2Stream that was being written to is no longer alive. Because in the end there is just a single socket that is present the whole time.

If the Http2Session was destroyed correctly, then gRPC would have to re-create it, which would certainly fail because at this point no new socket can successfully be opened.

Aprona commented 1 year ago

Any update on this? We are seeing this quite regularly and it is causing a lot of issues. Mainly because the high volume of messages and a need for manual intervention to re-establish the connection is is a bad combination

Currently using version 4.0.1 in GKE environment

nguyen-vo commented 7 months ago

You can observe this behavior while using Cloud Run. Occasionally Cloud Run will reset the underlying TCP socket this is similar to what @surlyhacker describes. https://cloud.google.com/run/docs/container-contract#outbound-reset.

feywind commented 6 months ago

As several have noticed, these stream timeout issues seem to be at a layer lower than Pub/Sub. We don't have a ton of visibility into it, and I doubt we can do much to mitigate it directly.

What I'd like to do is get the Pub/Sub library detecting when it needs to close and restart connections, which seems like a reasonable fix at this layer. Papering over the problem, but... yeah.

In the meantime, we have actually set the keepalives by default now.

feywind commented 6 months ago

By the by, I don't know if any of this great low level analysis would be interesting to @murgatroid99 - passing it along in case: https://github.com/googleapis/nodejs-pubsub/issues/1764#issuecomment-1605803635

surlyhacker commented 6 months ago

By the by, I don't know if any of this great low level analysis would be interesting to @murgatroid99 - passing it along in case: #1764 (comment)

Thanks for working on this @feywind I think that also exposing some of the underlying (e.g. gRPC) APIs directly for users who want to access them could also help a lot, such as a cleaner way to get to getConnectivityState() and watchConnectivityState() as I described in an earlier comment

murgatroid99 commented 6 months ago

I maintain the gRPC library, so I am speaking from that perspective.

I am unsure why you would need to do any additional work to "bubble up" connection failures detected by the keepalive functionality. When the gRPC client sees that a keepalive ping does not get a response within the configured timeout (default 20s), it closes that connection and ends all open streams on that connection with an error. The pubsub library should see that error without you taking any explicit action.

As a general rule, you as a user should essentially never need to use getConnectivityState and watchConnectivityState. There is almost certainly a better way to do what you want, or something is broken.

feywind commented 6 months ago

Ah, okay. Then maybe what's needed here is to figure out why those errors might not be bubbling up. (Probably because we are retrying internally, but maybe it would be useful for some users to receive that as a sign of possible trouble.)

nguyen-vo commented 4 months ago

In our workload, we use Cloud Run to listen to PubSub messages. As mentioned in this comment, Cloud Run occasionally reset the underlying TCP socket which causes the client to behave as others have mentioned in this issue. Our workaround in this case is we use 1 client per subscription and set a timer when opening a connection. If a subscriber doesn't receive any message within that period, unsubscribe and recreate a new PubSub client.

We opened a support ticket and get to know that when we create a new PubSub client, a new TCP connection is established. And they recommend reinitializing the gRPC client (PubSub client) when detecting a network reset. However, Cloud Run does not send an indication of when the container does that so we decided to implement a timer. After that we dont have any connectivity issue anymore.

So I spent some time looking into the message-stream module. As the client object is persistent throughout the lifetime of the message-stream instance, it will continue to use the same TCP connection even when a stream is ended and reset

@feywind I believe the issue will be resolved, if we could have a way to reinitialize the PubSub client when resetting the stream or when we know the connection is lost

murgatroid99 commented 4 months ago

@nguyen-vo Do you have keepalives enabled, as mentioned in the other comments? That should address the issue you are describing.

nguyen-vo commented 4 months ago

I assume you meant the grpc keep alive option. We are still using v3.60 and this option is enabled. Though did get the issue sometimes, hence we implemented the reset timer

googleapis / nodejs-pubsub

Streaming pull subscriber does not notify of loss of network #1764

Steps to reproduce