grpc / grpc-dart

The Dart language implementation of gRPC.
https://pub.dev/packages/grpc
Apache License 2.0
852 stars 267 forks source link

KeepAlive Ping Error Handling #738

Open lyokato opened 4 days ago

lyokato commented 4 days ago

The following is an excerpt of the code that sets up KeepAlive written in lib/src/client/http2_connection.dart

      if (options.keepAlive.shouldSendPings) {
        keepAliveManager = ClientKeepAlive(
          options: options.keepAlive,
          ping: () {
            if (transport.isOpen) {
              transport.ping();
            }
          },
          onPingTimeout: () => transport.finish(),
        );
        transport.onFrameReceived
            .listen((_) => keepAliveManager?.onFrameReceived());
      }

We found that the above code occasionally threw an exception when calling the transport.ping(). Sometimes an HTTP/2 Connection error occurred, starting from the http2 library side.

Therefore, we have applied the following patch to prevent the exception from flying.

              transport.ping().catchError(() {
                shutdown();
              });

I am not sure if this is the appropriate way to fix the problem, but I will report it to you.

mosuem commented 4 days ago

Thanks for reporting! I will take a look. Do you have any more information on the environment where this occurs? I haven't seen this happening on my machine.

lyokato commented 3 days ago

I was using the following


When we had users using the app in this situation, errors were reported to Firebase Crashlytics and Bugsnag.

HTTP/2 error: Connection error: Connection is being forcefully terminated. (errorCode: 10)

This is what was flowing to PlatformDispatcher.onError in the Flutter application. This means that an uncaught error is occurring somewhere.

The place where the request is sent is enclosed in try/catch as described above. This is why I thought the KeepAlive ping was suspicious.


Then we looked at where this error originated in the first place.

https://github.com/dart-lang/http2/blob/master/lib/src/connection.dart

There are several places in this code where the following calls are made

 _terminate(ErrorCode.CONNECT_ERROR, causedByTransportError: true);

The following is implemented in the _terminate method called here.

      var exception = TransportConnectionException(
          errorCode, 'Connection is being forcefully terminated.');

      // Close all streams & stream queues
      _streams.terminate(exception);

      // Close the connection queues
      _incomingQueue.terminate(exception);
      _outgoingQueue.terminate(exception);

      _pingHandler.terminate(exception);
      _settingsHandler.terminate(exception);

From the error message, we could see that it was this very exception that occurred this time.

We found that this exception was being passed to streams, incomingQueue, outgoingQueue, pingHandler, and settingsHandler.

Since we thought ping was suspicious, we checked the code of this _pingHandler and tracked how it was being used from the grpc-dart side.

As a result, we found that no catchError was made at the point of the first patch.

After applying the patch, I have had users use the app, and the error is no longer being sent to PlatformDispatcher.onError.


We do not know how to reproduce this reliably.

We are using grpc via an AWS load balancer. It is possible that the problem occurs when we hang a connection for a long time via the load balancer, but we have not yet investigated that in detail. We plan to investigate further.

mosuem commented 3 days ago

Thanks for the write-up, this is helpful.