temporalio / sdk-typescript

Temporal TypeScript SDK
Other
519 stars 103 forks source link

[Bug] `reuseV8Context` breaks workflow isolation #1355

Closed neelance closed 7 months ago

neelance commented 7 months ago

We just ran into the issue that v1.9.0 broke our workflow code. The issue is that we were using a global variable (global = at the top-level of the TypeScript file) to collect cleanup steps that were supposed to get executed at the end of the workflow. Prior to v1.9.0 this worked fine, because each workflow run had its own set of global variables.

With reuseV8Context however it caused the cleanup steps to leak into other workflow runs. If workflow run X added a cleanup step and went to sleep and then another workflow run Y got executed with the same V8 context, then Y would also invoke the cleanup steps of X.

This issue was hard to notice during testing. Fortunately we were able to recover with no long-lasting impact.

It seems like a footgun that it is so easy to break isolation now. One improvement would be to make AsyncLocalStorage available to the workflow code. I wanted to use if back when I implemented the cleanup logic, but it was not available so I opted for the global variable instead.

bergundy commented 7 months ago

Thanks for reporting.

Can you post a repro?

You can use workflow specific AsyncLocalStorage.

neelance commented 7 months ago

I did not know that you are exposing AsyncLocalStorage via @temporalio/workflow. This is perfect and the proper way to implement it. Thanks!

bergundy commented 7 months ago

@neelance I'm curious how reuseV8Context broke isolation for you. Could you please provide more details?

neelance commented 7 months ago

Here is the concept:

const someList: string[] = [];

export async function testWorkflow() {
   someList.push(workflowInfo().workflowId);
   sleep(10_000);
   await activities.printList(someList);
}

I think you should be able to observe entries from other workflows when you invoke it multiple times.

bergundy commented 7 months ago

I can't reproduce this:

https://github.com/temporalio/sdk-typescript/pull/1356/files

neelance commented 7 months ago

Strange. I just reproduced it like this:

let value = 0;
export async function example(): Promise<number> {
  value++;
  return value;
}

Every time I run this workflow, I get the next number until I restart the worker. When I disable reuseV8Context I always get 1. I wonder what's the difference between our setups...

bergundy commented 7 months ago

How are you creating your bundle (ignore if you don't know what I mean) and worker?

neelance commented 7 months ago

Yes, that's exactly what I was just looking at and I think I found it: We're building the workflow bundle on our own and you seem to manipulate the __webpack_module_cache__, which does not seem to work with our bundle. We're still using Webpack, so I guess there should be some way to make it work...

neelance commented 7 months ago
    code = code.replace(
      'var __webpack_module_cache__ = {}',
      'var __webpack_module_cache__ = globalThis.__webpack_module_cache__'
    );

🙈

bergundy commented 7 months ago

I see, well we definitely don't support that. You're on your own there 😬

neelance commented 7 months ago

Problem is that using your bundler comes with inconsistencies to how we bundle the rest of our codebase...

And well, you're not really discouraging it either: CleanShot 2024-02-09 at 00 27 33 CleanShot 2024-02-09 at 00 28 05 https://docs.temporal.io/dev-guide/typescript/foundations

bergundy commented 7 months ago

bundleWorkflowCode is a utility that we maintain and support. It's used internally by the worker if you give it a workflowsPath option.

What are the inconsistencies I wonder. Is using the webpack hook we provide not enough?

bergundy commented 7 months ago

I'll have the docs fixed, don't know who put this advice there.

neelance commented 7 months ago

Is using the webpack hook we provide not enough?

I just tried bundleWorkflowCode and got it to work. We simplified our Webpack setup in the last months, so that helped. Honestly, I'm still not that happy about having a second bundler. I'll think about it some more...

bergundy commented 7 months ago

Thanks for the updates.

Let me know if you have more thoughts on the subject.

neelance commented 7 months ago

Here's an update:

We have switched to using bundleWorkflowCode. It was easy enough to integrate into our current build setup. Especially the Webpack hook helped.

However, we've still found a slightly different workflow isolation bug, which still happens when using your bundler. We were also able to resolve it with AsyncLocalStorage, but you probably want to look into it. Unfortunately it is again difficult for me to come up with a reproduction, so I will describe what we saw and maybe you can already see the issue:

We saw some workflows getting stuck with a nondeterminism error. However, I was able to replay these workflows locally on the same code without any issue. I then guessed that they will probably resolve when restarting the workers. We then restarted all workers without any code change and the workflows completed fine as expected. This strongly suggests an isolation issue.

The difference to the case above is that this time it is not an execution of workflow A leaking into workflow B, but one execution of workflow A leaking into a replay (on the worker) of the same workflow A. In our case signals were involved: The signal handler was writing the data it received into a global variable which served as a map (the data is keyed). Some specific condition then waited for an entry in this map. The nondeterminism was due to code at the beginning of the workflow already reading data from the map that should not yet be available until some later signal in the history of the workflow.

Have I explained it well enough? My best guess is that you are not discarding the Webpack module cache when replaying a workflow.

bergundy commented 7 months ago

Thanks for the update and detailed report.

I wonder what is going on there, I don’t see a reason why this would happen.

Does disabling reuse v8 context also prevent workers getting into this state?