grpc / grpc-dart

The Dart language implementation of gRPC.
https://pub.dev/packages/grpc
Apache License 2.0
835 stars 256 forks source link

Fix hang that occurs when streaming calls are ongoing and a hot-restart occurs in Flutter web #718

Open galenwarren opened 6 days ago

galenwarren commented 6 days ago

When hot-restarting in Flutter Web, streaming GRPC calls can get stuck in this method of _GrpcWebConversionSink:

  @override
  void add(ByteBuffer chunk) {
    _chunkOffset = 0;
    final chunkData = chunk.asUint8List();
    while (_chunkOffset < chunk.lengthInBytes) {
      switch (_state) {
        case _GrpcWebParseState.init:
          _frameType = _parseFrameType(chunkData);
          break;
        case _GrpcWebParseState.length:
          _parseLength(chunkData);
          break;
        case _GrpcWebParseState.message:
          _parseMessage(chunkData);
          break;
      }
    }
    _chunkOffset = 0;
  }

The while loop runs, but the switch statement doesn't match any cases so it just keeps looping. This seems to have something to with some issues around how Dart handles enums across isolates (some discussion here and elsewhere) and the fact that hot-restarting involves the creation of a new isolate.

In any case, adding a default handler causes the loop to break when a call is orphaned in this manner, which lets the isolate shut down and the hot-restart succeed.

There is also some discussion in the Dart/Flutter community about creating hooks for all platforms that would allow one to reliably detect when an isolate were shutting down and take action. If that were in place, that would allow for a cleaner solution, but it doesn't seem to be available yet.

linux-foundation-easycla[bot] commented 6 days ago

CLA Signed

The committers listed above are authorized under a signed CLA.

mosuem commented 5 days ago

Thanks for the PR!

The enum switch not being exhaustive is really worrying behavior in my opinion. I also don't see where the _GrpcWebParseState would be sent across isolates?

mosuem commented 5 days ago

@galenwarren, could you possibly provide a reproducible example or add a test? Just wanting to make sure the exhaustiveness is the issue.

mraleph commented 5 days ago

@nshahan @biggs0125 can you shed some light on how hot restart is wired up in Flutter Web through DDC (or dart2js if supported)? It seems extremely problematic if hot restart keeps old code running but changes identity of constants which breaks things like switches.

@galenwarren This seems like a serious bug in the hot restart mechanism. Trying to work around it in individual packages is like trying to carry water in a sieve.

galenwarren commented 5 days ago

@mosuem @mraleph This behavior only happens during hot restarts, so I'm not sure how to add a test. But this is absolutely what's happening -- I can catch it in the debugger when I do a hot restart with an active streaming call and observe it not hitting any of the case statements, just looping forever and blocking the hot restart, forcing a (time-consuming!) stop and restart. With a default case added, it exits and there is no hang.

I've encountered this problem with hanging during hot restarts for months -- several times per day -- but couldn't figure out what was going on. Since I've made this change locally, I haven't had a single issue! Already saved me a ton of time. :)

I agree that this is not the root problem. The root problem seems to be that there is no reliable way to detect a pending isolate shutdown in order to run cleanup code. If that existed, we could hook that and abort the in-progress calls. Some discussion of this here for the FFI case. More discussion here. There was even a PR to add a hook (which I can't find now, but will look for more if you want) that was postponed.

However, it's a tremendous time-saver, I can't imagine I've been the only person affected by this. Outside of the multiple-isolate case that occurs during hot restart, I don't think the default case would ever be hit so it should be harmless.

EDIT: Also, I'm working exclusively in Flutter web, so maybe this issue and/or the weirdness with enum values across isolates only occurs there, I can't say for sure.

EDIT #2: I could probably figure out how to repro it and do a screen capture of the debugger both with and without the default case added, if that would be helpful to establish that this is what is happening? Just let me know.

mraleph commented 5 days ago

@galenwarren note that there are no isolates on the Web, so discussions around isolates shutdown on hot-restart don't apply here.

galenwarren commented 5 days ago

@mraleph Yeah, I'm sure you're more familiar with Dart internals than I am. I was speculating it was related to isolates because of the similar discussions for FFI.

Nonetheless, at a high level, this is what is happening on Flutter web. There must be some other mechanism that is causing the enum values not to be recognized in the switch statement, causing an infinite loop. Would you like me to do a screen capture so you can see the debugger behavior?

nshahan commented 4 days ago

can you shed some light on how hot restart is wired up in Flutter Web through DDC (or dart2js if supported)? It seems extremely problematic if hot restart keeps old code running but changes identity of constants which breaks things like switches.

Hot restart is only supported in DDC. The model we aim to support is to: 1) Recompile the necessary libraries. 2) Clear all state in the application. 3) Reload the recompiled libraries in the browser. 4) Rerun the main() method.

In practice this means that values from one run of the app should not exist in later restarts of the application and it should be considered a bug if they do. It sounds like this might be an interaction between old code from a callback that is still running and the global constants table that contains the new versions of the enum values. Historically callbacks have been a source of problems like this and we have added some cleanup for callbacks from previous versions of the application (dom listeners, timers, etc).

@galenwarren With your patch that bails out of the loop, would it be possible to get a full stack trace at the moment you have an enum value that doesn't match in the switch? From there we might be able to recognize a convenient place we could add some additional validation to help cancel old callbacks of this kind.

The root problem seems to be that there is no reliable way to detect a pending isolate shutdown in order to run cleanup code. If that existed, we could hook that and abort the in-progress calls.

@ditman I know we have a mechanism for applications to specify cleanup code to run on hot restart in google3. Was anything ever added to flutter web to provide the same affordance?

galenwarren commented 4 days ago

@nshahan Regarding:

It sounds like this might be an interaction between old code from a callback that is still running and the global constants table that contains the new versions of the enum values.

In this app, there are long-running grpc server-streaming calls, i.e. one request and then a stream of responses over long-ish periods of time. These calls would be active during hot restarts and could be the source of the callbacks.

Here's the stacktrace, it's taken from a breakpoint at the break; in the default: case below, which is line 157 of web_streams.dart:

  @override
  void add(ByteBuffer chunk) {
    _chunkOffset = 0;
    final chunkData = chunk.asUint8List();
    // in flutter web, when a hot-restart is requested, the code can get stuck
    // in the following loop if there is an active streaming call. the
    // switch statement is invoked but doesn't match any of the cases. this
    // presumably has something to do how enums work across isolates.
    // possibly related to https://github.com/dart-lang/sdk/issues/35626.
    //
    // in any case, adding a default handler to the switch statement
    // causes this loop to end in that case, allowing the old isolate
    // to shut down and the hot-restart to work properly.
    processingLoop:
    while (_chunkOffset < chunk.lengthInBytes) {
      switch (_state) {
        case _GrpcWebParseState.init:
          _frameType = _parseFrameType(chunkData);
          break;
        case _GrpcWebParseState.length:
          _parseLength(chunkData);
          break;
        case _GrpcWebParseState.message:
          _parseMessage(chunkData);
          break;
        default:
          break;
      }
    }
    _chunkOffset = 0;
  }```

_StackTrace (http://localhost:8080/dart-sdk/lib/_internal/js_dev_runtime/patch/core_patch.dart 843:28 get current http://localhost:8080/packages/grpc/src/client/transport/web_streams.dart 157:11 eval http://localhost:8080/packages/grpc/src/client/transport/web_streams.dart 157:11 eval http://localhost:8080/packages/grpc/src/client/transport/web_streams.dart 157:11 add http://localhost:8080/dart-sdk/lib/convert/chunked_conversion.dart 69:5 add http://localhost:8080/dart-sdk/lib/async/stream_transformers.dart 111:7 [_handleData] http://localhost:8080/dart-sdk/lib/async/zone.dart 1407:47 _rootRunUnary http://localhost:8080/dart-sdk/lib/async/zone.dart 1308:19 runUnary http://localhost:8080/dart-sdk/lib/async/zone.dart 1217:7 runUnaryGuarded http://localhost:8080/dart-sdk/lib/async/stream_impl.dart 365:5 [_sendData] http://localhost:8080/dart-sdk/lib/async/stream_impl.dart 541:13 perform http://localhost:8080/dart-sdk/lib/async/stream_impl.dart 646:10 handleNext http://localhost:8080/dart-sdk/lib/async/stream_impl.dart 617:7 http://localhost:8080/dart-sdk/lib/async/zone.dart 1391:47 _rootRun http://localhost:8080/dart-sdk/lib/async/zone.dart 1301:19 run http://localhost:8080/dart-sdk/lib/async/zone.dart 1209:7 runGuarded http://localhost:8080/dart-sdk/lib/async/zone.dart 1249:23 http://localhost:8080/dart-sdk/lib/async/zone.dart 1399:13 _rootRun http://localhost:8080/dart-sdk/lib/async/zone.dart 1301:19 run http://localhost:8080/dart-sdk/lib/async/zone.dart 1209:7 runGuarded http://localhost:8080/dart-sdk/lib/async/zone.dart 1249:23 callback http://localhost:8080/dart-sdk/lib/async/schedule_microtask.dart 40:11 _microtaskLoop http://localhost:8080/dart-sdk/lib/async/schedule_microtask.dart 49:5 _startMicrotaskLoop http://localhost:8080/dart-sdk/lib/_internal/js_dev_runtime/patch/async_patch.dart 181:7 )

nshahan commented 4 days ago

cc @sigmundch This is the issue I was discussing briefly earlier today.

ditman commented 4 days ago

@ditman I know we have a mechanism for applications to specify cleanup code to run on hot restart in google3. Was anything ever added to flutter web to provide the same affordance?

@nshahan This is how open-source Flutter web handles restart signals:

These are some of the different points where the engine adds listeners:

AFAIK this is not exposed to end users (flutter apps/plugins); if an end-user attempts to hook to that ext.flutter.disassemble "channel", it fails because it's a single-subscription stream (or similar).

sigmundch commented 4 days ago

These are some of the different points where the engine adds listeners

I wonder if any of those points is observable from the app, and could be used to trigger cleanup in the app and stop any open grpc calls? For example, I noticed one of the registerHotRestartListener indirectly disposes the browserHistory. Is it possible to detect that the browserHistory was disposed and use that as the trigger?

ditman commented 4 days ago

I wonder if any of those points is observable from the app ... Is it possible to detect that the browserHistory was disposed and use that as the trigger?

@sigmundch If we want to expose this to apps, I think we should design the way to do it "properly" (I imagine a BeforeHotRestartStream or something similar?) at the framework level (dart:ui); and finally solve this veeeery old issue:

If we want this to be a web-only API, it could be tucked in the dart:ui_web package, which is where we put "the web engine API facades that we need from app/plugin code", we shouldn't observe on a side-effect (that can be triggered by other things) IMO.

sigmundch commented 4 days ago

Completely agree :smile: - I just suddenly got curious if it was possible as a way to test things out

ditman commented 4 days ago

got curious if it was possible as a way to test things out

Ahh, gotcha, I thought we needed this as a permanent feature (got excited for a second :P)

You can potentially listen to the browser History with the popstate event :P https://developer.mozilla.org/en-US/docs/Web/API/PopStateEvent (if there's any routing/something that changes the URL, I guess!)

nshahan commented 2 days ago

cc @biggs0125 Your changes to the async logic in DDC might present an opportunity to identify and cancel more callbacks from the previous generation of the program when they attempt to start running again.