temporalio / sdk-typescript

Temporal TypeScript SDK
Other
496 stars 93 forks source link

[Bug] Unexpected behaviour related to CancellationScope.nonCancellable #1423

Open zijianzz opened 1 month ago

zijianzz commented 1 month ago

What are you really trying to do?

Use condition with timeout inside nonCancellable scope, and wait a signal to unlock condition await

Describe the bug

Hello, I recently saw a weird behaviour which I don’t know if it’s normal or not because I didn’t find any documentation related to this subject. Here is a test code I used to reproduce this issue

export async function runCancellableActivity(): Promise<void> {
  try {
    await sleep(1000000);
  } catch (err) {
    await CancellationScope.nonCancellable(async () => {
      try {
        console.log('condition');
        await condition(() => false, 10000); // worked good for => await condition(() => false);
      } catch (error) {
        console.log('still error', error);
      }
    });
    throw err;
  }
}

Minimal Reproduction

my condition is wrapped inside nonCancellable scope, if I understood well, CancelledFailure should never propagated into this scope but it’s not always the case, if I decide to give a timeout to my condition. my app did printed still error on console with CancelledFailure exception.

Environment/Versions

Additional context

bergundy commented 1 month ago

What Node version?

zijianzz commented 1 month ago

What Node version?

I tested in both 16 and 18, I got the same result

cirias commented 1 month ago

I run into the same issue. This is because the timeout feature in condition is implemented through a cancellable CancellationScope. And a cancellable CancellationScope cancels when any of its parent scope cancels, even when the parent is nonCancellable.

The problem is why a scope could still be cancelled even when it's inside a nonCancellable scope. Intuitively a cancellable scope in a nonCancellable scope should only be cancelled by call cancel method directly on the scope object.

My guess is, when the root scope is cancelled, the workflow should propagates the cancel to all cancellable scope even it's in a nonCancellable scope. But we probably need an option to disable this feature.

cirias commented 1 month ago

@zijianzz I've been using this hack function as a walk-around.

async function conditionTimeout(fn: () => boolean, timeout: Duration): Promise<void> {
  // here should check all parents is cancellable
  // in my use case this is enough
  if (CancellationScope.current().cancellable) {
    return condition(fn, timeout).then((pass) => {
      if (!pass) {
        throw ApplicationFailure.nonRetryable('condition timed out');
      }
    });
  }

  return Promise.race([
    // it's impossible to cancel this sleep later after the race resolved,
    // because if as long as you wrap it in a cancellable scope, it cancels immediately 
    // when any of the parent scopes cancelled, 
    // making it the same behavior as the original condition function.
    sleep(timeout).then(() => {
      throw ApplicationFailure.nonRetryable('condition timed out');
    }),
    condition(fn),
  ]);
}
mjameswh commented 1 month ago

Thanks for the repro. It looks like the behaviour for the cancelable/non-cancellable/cancellable case was never clearly defined, and the effective behaviour is definitively problematic. We'll discuss that internally.

bergundy commented 1 month ago

I don't think it was not clearly defined, just not properly implemented. We'll need to fix this with a patch that ensures we don't break history compatibility.

cirias commented 1 month ago

Thanks @mjameswh @bergundy ! While you discuss the cancellable/non-cancellable cases, please also consider the case of activity cancellation. Currently, to cancel an activity can only be achieved by calling it inside a cancelablescope. Given the current behavior of "cancel on root cancels all", we can't create a cancelable activity when the workflow is cancelled. Because the activity wrapped in a cancelablescope will be cancelled/skipped immediately.

In general, the problem is the during the cleanup process, when a workflow is cancelled, we still need the ability to control when to cancel some execution. For example, to use condition with timeout during the cleanup, or call an activity but also want to have the ability to cancel it later during the cleanup. Right now, since it cancels everything, we lose the control of cancellation.