landelare / ue5coro

A C++20 coroutine implementation for Unreal Engine 5 that feels almost native.
BSD 3-Clause Clear License
543 stars 48 forks source link

UObject destruction causes it to leak its incomplete(suspended) latent coroutines #27

Open yuriy0 opened 1 month ago

yuriy0 commented 1 month ago

If you create a latent coroutine bound to some object, then that couroutine suspends, then the object is destroyed, the coroutine leaks.

For clarity consider this example:

UCLASS()
class UTestCoroObj : public UObject
{
public:
    GENERATED_BODY()

    UE5Coro::TCoroutine<void> Test(FForceLatentCoroutine _ = {})
    {
        UE_LOG(LogTemp, Warning, TEXT("Entering coroutine UTestCoroObj"));
        ON_SCOPE_EXIT
        {
            UE_LOG(LogTemp, Warning, TEXT("Destroying coroutine stack UTestCoroObj"));
        };

        co_await UE5Coro::Async::PlatformSeconds(100000.0f);
    }

    UFUNCTION(BlueprintCallable)
    static void Create(UWorld* World)
    {
        NewObject<UTestCoroObj>(World)->Test();
    }
};

Calling the Create method, then forcing a garbage collection, results in the coroutine being leaked. If the awaited expression never resumes, then the ON_SCOPE_EXIT never runs and the suspended awaiter is never destroyed. Here it is PlatformSeconds (which technically will resume in the far future) but it could be any awaiter, in particular one which can never resume once the object is destroyed.

This is what happens:

  1. Test reaches a suspend point, calling TAwaiter::await_suspend
  2. That marks the promise as "detached from game thread" via DetachFromGameThread (https://github.com/landelare/ue5coro/blob/master/Plugins/UE5Coro/Source/UE5Coro/Public/UE5Coro/AsyncCoroutine.h#L93-L100)
  3. The object is gc'ed some time later, so the latent action manager calls ~FPendingLatentCoroutine which cancels the coroutine and resumes it.
  4. FLatentPromise::Resume will explicitly do nothing, not destroying or resuming the coro handle, because the promise is marked "detached from game thread" (https://github.com/landelare/ue5coro/blob/master/Plugins/UE5Coro/Source/UE5Coro/Private/LatentPromise.cpp#L203-L217)

The comment in FLatentPromise::Resume says "If ownership is borrowed, let the guaranteed future Resume call handle this" but there is not guaranteed to be another Resume call.

I don't fully understand the meaning of "detached from game thread" and why every Latent promise is marked as such, because of that I'm not sure if this is really a bug or intended behavior. The documentation specifically calls out delegate awaiters as suffering from this issue, but also other awaiters (in fact, all async awaiters?) suffer from this issue as well. There is a workaround described for delegates specifically (which doesn't let you capture the delegates arguments, so is strictly not a replacement) but no such workaround for the general case.

Is it dangerous to destroy the coroutine right away, i.e. by not calling DetachFromGameThread in the first place or by calling AttachToGameThread from ~FPendingLatentCoroutine?

landelare commented 1 month ago

The latent action being destroyed sends a forced cancellation request to the coroutine, but cancellations are processed when the coroutine would resume. Your coroutine is on "step 5" in FTimerThread, which will deliver that guaranteed Resume call... 100000 seconds later, tomorrow.

I'd suggest either reducing the time to something more reasonable, or replacing Async::PlatformSeconds with Latent::RealSeconds if you need such a long, but cancelable wait. FLatentAwaiters process cancellation on every tick. Alternatively, you could run a loop of 100000 1-second awaits to check for cancellations more often, which should still have fairly low overhead.

I'm not entirely happy with how FForceLatentCoroutine works in 1.x, so I'm considering this an open bug regardless. You got lucky in this instance because your coroutine is a nonstatic member on a UCLASS, but there are genuine cases where coroutine lifetime gets promoted to match your UWorld's, which could amount to a leak in practice if it was the persistent world. 2.0 will deliberately break these silent surprises and require callers to be explicit with their intentions, fixing the underlying problem.

To avoid this situation in 1.x, make sure that the coroutine's owning object and world context are the same and it's passed as the first parameter in static UFUNCTIONs (non-statics will use this), which will continue working in 2.0 without any additional changes. This matches how most latent UFUNCTIONs in the engine itself are written. You can also run in full async mode and keep a coroutine-local TWeakObjectPtr or TStrongObjectPtr to this that you capture before the first co_await, so that you're in full control of lifetime and ownership.

I don't fully understand the meaning of "detached from game thread"

In a nutshell, this is my solution to reconcile C++ object lifetimes (e.g., RAII within a coroutine body, ON_SCOPE_EXIT in your case) with the latent action manager's behavior. A latent coroutine awaiting a TAwaiter transfers coroutine lifetime from the latent action to the awaiter, with the implicit assumption that it will return it to the FLatentPromise later.

UE5Coro 2.0 will have more documentation around these internal mechanisms, intended for maintainers of forks or people who are comfortable with using the private/unstable API.

Is it dangerous to destroy the coroutine right away

Yes. You'll need to do something about callbacks wanting to talk to a deleted promise.

yuriy0 commented 1 month ago

The latent action being destroyed sends a forced cancellation request to the coroutine, but cancellations are processed when the coroutine would resume.

Thank you for your detailed reply. I think this has confirmed my understanding that what I want is not presently possible - that is, to delete the entire coroutine when the "owning" FPendingLatentCoroutine detects its owner is gone via NotifyObjectDestroyed.

You can also run in full async mode and keep a coroutine-local TWeakObjectPtr or TStrongObjectPtr to this that you capture before the first co_await, so that you're in full control of lifetime and ownership.

Yes, I've been trying this, but it gets quite clunky, as I'm basically wrapping every single co_await in a check to determine whether or not we have been cancelled. This also means making them non blueprint useable, at least as I understand, "full async" means not exposed to BP (or if exposed, the BP latent function just fires immediately instead of when the coroutine completes). Awaiting TPromise helps in some cases too because that asserts if its destroyed without being completed, but that doesn't solve the problem, just makes it easier to spot.

A latent coroutine awaiting a TAwaiter transfers coroutine lifetime from the latent action to the awaiter, with the implicit assumption that it will return it to the FLatentPromise later.

I think that some of my functionality challenges this assumption because some awaiters are never going to resume, and that fact is hard to see from the calling code.

Yes. You'll need to do something about callbacks wanting to talk to a deleted promise.

I've tried to start replacing uses of FPromise& in awaiters with a checked pointer type which only operate on the promise handle while holding the lock in FPromiseExtras so that I can delete promises and later check if they're still alive. This includes tracking that the promise is not resumed or done since it seems std::coroutine_handle doesn't provide this functionality.

landelare commented 1 month ago

If you're making custom awaiters using the private API, and you're mainly targeting latent coroutines, I'd suggest basing them on FLatentAwaiter.

landelare commented 1 week ago

In the 2.0 preview, the other issue that I mentioned that might leak coroutines is fixed.

Your particular scenario works as intended in both 1.x and 2.0, although I agree that it is not ideal. In 2.x, some awaiters might see cancellation improvements (no promises!) but this cannot be solved universally without massively increasing the overhead of using awaiters in some cases.

In 2.0, using co_await Latent::Seconds(100000); will provide one-tick cancellation behavior, regardless of coroutine execution mode.