landelare / ue5coro

A C++20 coroutine implementation for Unreal Engine 5 that feels almost native.
BSD 3-Clause Clear License
543 stars 48 forks source link

Latent Coroutine Crash On Exiting Game - Repro included #22

Closed nwilson100 closed 5 months ago

nwilson100 commented 6 months ago

This repro case is based upon a crash in a real use case of mine. The use case in the game is that one parent actor manages the lifetime of other actors. When the parent actor gets destroyed, it calls a coroutine on its child actors which they use to destroy themselves latently at their own pace. This works fine during gameplay, but if the parent actor gets destroyed by the game being stopped in editor, the coroutines in the child actors will cause a crash.

This was tested including the recent changes in your commit 0cf2485.

Repro Case:

  1. Make empty level I'll refer to as "CrashLevel".
  2. Ensure the game mode is set to the blank GameModeBase.
  3. Place an actor of type CoroParent in CrashLevel.
  4. Run the game in-editor.
  5. Observe the world outliner to verify CoroChild actors have been created, as well as blank actors.
  6. Stop running the game by clicking the stop button in-editor.
  7. The game crashes at this point.

One unknown detail is whether this crash will also occur when closing a standalone build, but the repro setup for that would look similar to this in-editor setup. I've also seen this crash occurs with non-latent coroutines anecdotally, but this is not included in the repro case.

Attached is an image of the coro frame pointer and callstack at time of the crash. Most of the time the crash occurs outside of the scope of a coro frame pointer, though. CoroFramePointerOnCrash

Source for required classes CoroParent and CoroChild are here: CoroCrashOnGameExitSource.zip

landelare commented 6 months ago

Thank you for the gold-quality repro, as always!

ACoroParent::BeginDestroy is calling coroutines on invalid child actors. By changing the condition from if (childActor) to if (IsValid(childActor)) I've managed to avoid the crash with both pending kill enabled and disabled.

I'm not sure what I'll do with the plugin, but hopefully this will help with your immediate problem.

landelare commented 6 months ago

As an alternative repro, I changed CleanUp to run in async mode and take responsibility for lifetime management. If I do this:

TWeakObjectPtr WeakThis = this;
co_await UE5Coro::Latent::NextTick();
if (!WeakThis.IsValid())
    co_return;

the crash is prevented.

@nwilson100 could you please test master with your real project? I'm expecting latent mode to check and async mode to invoke UB without such a weak-this check. I have added a few minor improvements that aren't directly related to these crashes but should improve world detection.

nwilson100 commented 6 months ago

Hello, sorry for the delay. Before I get into the test you asked for, you're right that checking the childActor with IsValid was a fix, and I can confirm that this also does fix the issue for the real project! Thanks for that.

The following tests using latest master were without the IsValid check applied in user code.

Confirmed: Both the real-world repro case and the minimal repro case halt upon the "Provide a valid world context parameter" check in FLatentPromise::Init(), line 181.

Worth noting in addition: after updating to latest master, the project began crashing on start because it had some nested latent coroutines with the inner-most coroutine being a static function. It looks like previously, the LatentPromise was falling back to using a pointer to GWorld for the static latent coro. Naturally this was now solved by sending a context UObject * as a parameter to that coro. Sharing it as a heads up that the current master changes may manifest in that way for other users, in case it was not expected. I'm not particularly against this as it requires being explicit with your intended world context for these cases, but my original and definitely naive assumption was that the context was somehow able to be deduced by the static coro being invoked from within a coroutine in a UObject that does have a valid world context, even if the static coro wasn't a member of that UObject. That said, I'm not sure how that kind of context deduction could be implemented.

landelare commented 6 months ago

Thank you for the tests! In hindsight, I think that defaulting to GWorld was perhaps too generous for latent coroutines - BP itself expects you to run on a UObject with a valid GetWorld (usually: actors).

I was and still am thinking of possibly associating a TWeakObjectPtr<UWorld> with latent coroutines behind the scenes that it could offer as a fallback to callees without a world context, but I can already see that causing issues if there's a legitimate cross-world call.

landelare commented 5 months ago

@nwilson100 I have published a beta 1.10.1 with the intended bugfixes, with the next major release planned to have more invasive changes.

nwilson100 commented 5 months ago

@landelare This looks good. I mostly use latent coroutines with FForceLatentCoroutine, and am happy at the idea of being able to be more explicit with the world context a coro will be executing within. Most of my bugs involving coroutines have involved issues around that.

Along the lines of requiring latent coroutines to be more explicit with the intended world context, I have an idea that seems helpful for my use case, but I have a fairly limited perspective on all intended use cases, so bear with me. The idea is: Only for cases in which FForceLatentCoroutine is used, what do you think about requiring or allowing the world context to be passed to the coroutine via the FForceLatentCoroutine constructor, rather than as a separate parameter? Here are some advantages and disadvantages that come to mind for it:

Advantages:

  1. Allows the calling-context code to be more self-documenting and clear, e.g:
    
    Example original method signature: PerformLatentRoutine(UObject * context, UObject * source, UObject * target, FForceLatentCoroutine latent)
    //Usage at call-site
    PerformLatentRoutine(this, this, targetObj, {});

New method signature: PerformLatentRoutine(FForceLatentCoroutine latent, UObject source, UObject target) //Usage at call-site PerformLatentRoutine(FForceLatentCoroutine{this}, this, targetObj);


2. If the only constructor for FForceLatentCoroutine now requires a context object, this causes the convention previously checked & required at compile time to now be even more concrete. This enforces the context object convention in a way that's visible to IDE suggestions, such that it's transparently presented to the user as a requirement while writing such code:

Without the world context constructor required, you could write this with no error hints appearing: Declaration: LatentRoutine(FForceLatentCoroutine latent); Usage: LatentRoutine({}); Outcome: No error apparent at time of writing, but presumably would error at compile time as it will now explicitly require a first-param world context.

If FForceLatentCoroutine requires a context object as its parameter: Declaration: LatentRoutine(FForceLatentCoroutine latent); Attempted Usage: LatentRoutine({}); //Error hint provided by IDE: No default/empty constructor for FForceLatentCoroutine User can intuitively correct it to: LatentRoutine({this}); or LatentRoutine(FForceLatentCoroutine{this});



3. Reduces verbosity of method signatures by essentially combining the context param into the FForceLatentCoroutine, and avoids the empty latent constructors, i.e. `LatentRoutine(this, {});`

4. Coroutines using FLatentActionInfo can continue using the separate world context object param to match the BP strategy. So we now have two ways of using latent coroutines with explicit contexts.

5. User code specifies intent even more clearly if we rename FForceLatentCoroutine to something like FLatentCoroContext.
`FLatentCoroContext{this}` seems very appealing.

Disadvantages:
1. Deviates from the strategy used by FLatentActionInfo in BPs for latent actions in general. While allowing multiple workflows or angles for user code to specify intent can be a good thing, this can also feel like added complexity.

2. Would definitely break user code that already was using FForceLatentCoroutine with a first world context parameter, which otherwise may not have been broken with this coming update, if we do make the constructor mandatory rather than optional. They would need to make the context an argument to their FForceLatentCoroutine constructors, and adjust their coroutine's signatures to remove the context parameter.

There's probably other issues with this, but thought I would share this as food for thought!
landelare commented 5 months ago

@nwilson100 This is overall a good idea, but for now I think I'll go with "first parameter is the world context object". The next major update is currently planned to enforce this with no other parameters being looked at - other than reducing surprises, it should result in a little performance bump, too. My reasoning:

FForceLatentCoroutine is BP compatible and hidden on all call sites. This change would break every single affected coroutine call node and require BP to pack the world context into the struct instead of, e.g., being able to rely on meta=(WorldContext). This would need an extremely strong justification.

Latent coroutines are trying to match BP as much as C++ allows, and Latent::Chain() also relies on this convention heavily. This additional limitation will silently assure cross-compatibility in all cases.

I'm still looking into what to do with FLatentActionInfo. CallbackTarget is an obvious world context which is currently used with maximum priority, and I think it might be still preferable to this. I have no examples currently where this would be a significant difference though, even in contrived multi-world scenarios.

Regarding your usage examples, I would suggest declaring coroutines as TCoroutine<> Foo(FForceLatentCoroutine = {}) so you can call it as just Foo() at every single call site and mostly ignore that parameter's existence.

landelare commented 5 months ago

I consider this fixed in v1.10.1 (with more improvements to come in the next major release), if you're still having problems that aren't fixed by IsValid checks on your side and are NOT caught by checks/ensures, please feel free to reopen.