temporalio / sdk-typescript

Temporal TypeScript SDK
Other
513 stars 100 forks source link

[Bug] Worker crashes with "async hook stack has become corrupted" on Workflow Task timeout #939

Open thehellmaker opened 1 year ago

thehellmaker commented 1 year ago

What are you really trying to do?

Tell us at a high level what you're doing, to avoid XY problem (https://en.wikipedia.org/wiki/XY_problem) We had a workflow in development where we were still developing on it. Older version of workflow is still running when we change the code of workflow and deployed it and we see a crash in the system

Describe the bug

Error: async hook stack has become corrupted (actual: 112970, expected: 3) 1: 0x10229dab8 node::AsyncHooks::FailWithCorruptedAsyncStack(double) [/Users/random/.nvm/versions/node/v16.16.0/bin/node] 2: 0x102232164 node::AsyncHooks::pop_async_context(double) [/Users/random/.nvm/versions/node/v16.16.0/bin/node] 3: 0x102231d8c node::InternalCallbackScope::Close() [/Users/random/.nvm/versions/node/v16.16.0/bin/node] 4: 0x10223266e node::InternalMakeCallback(node::Environment, v8::Local, v8::Local, v8::Local, int, v8::Local, node::async_context) [/Users/random/.nvm/versions/node/v16.16.0/bin/node] 5: 0x10224a238 node::AsyncWrap::MakeCallback(v8::Local, int, v8::Local) [/Users/random/.nvm/versions/node/v16.16.0/bin/node] 6: 0x102346af6 node::worker::MessagePort::OnMessage(node::worker::MessagePort::MessageProcessingMode) [/Users/random/.nvm/versions/node/v16.16.0/bin/node] 7: 0x102cd275b uv__async_io [/Users/random/.nvm/versions/node/v16.16.0/bin/node] 8: 0x102ce654b uv__io_poll [/Users/random/.nvm/versions/node/v16.16.0/bin/node] 9: 0x102cd2ce1 uv_run [/Users/random/.nvm/versions/node/v16.16.0/bin/node] 10: 0x102232f9f node::SpinEventLoop(node::Environment) [/Users/random/.nvm/versions/node/v16.16.0/bin/node] 11: 0x1023a9cae node::worker::Worker::Run() [/Users/random/.nvm/versions/node/v16.16.0/bin/node] 12: 0x1023ad6a2 node::worker::Worker::StartThread(v8::FunctionCallbackInfo const&)::$_3::__invoke(void*) [/Users/random/.nvm/versions/node/v16.16.0/bin/node] 13: 0x7ff81c0e84e1 _pthread_start [/usr/lib/system/libsystem_pthread.dylib] 14: 0x7ff81c0e3f6b thread_start [/usr/lib/system/libsystem_pthread.dylib] [nodemon] app crashed - waiting for file changes before starting... ^C

bergundy commented 1 year ago

Interesting.. this looks like a Node.js issue, can you try running with different Node.js versions?

thehellmaker commented 1 year ago

Actually i figured out the issue. It was a bug in the worker code which lead to an infinite loop. But this should be handled better by temporal i feel with proper error messaging

My worker has a code like this

while(true) {
        // FIX: currentState = WorkflowState.COMPLETE
        await condition(() => currentState !== WorkflowState.COMPLETE);
        if (currentState === WorkflowState.REVIEWED_AND_SAVED) {
            // call an activity and
            return response
        }
}

The commented out line with // FIX: is the fix

The problem here is

bergundy commented 1 year ago

I don't know how much we can handle this since it looks like the error happens deep in Node internals. I can see why this happens though. Thanks for the report.

The SDK does protect you from tight loops by setting a timeout on workflow activations but looks like something else is going on here.

It looks like async_hooks and the vm context execution timeout clash. We should try to reproduce this and see if there's anything the SDK can do or to mitigate this but I suspect the issue is an unhandled edge case in Node.js.

mjameswh commented 1 year ago

Here is a repro without condition:

// Fails with corrupted async stack
export async function testWorkflow(): Promise<void> {
  while (true) {
    await new CancellationScope({ cancellable: true }).run(() =>
      Promise.resolve(true)
    );
  }
}

This is obviously an endless loop. I'd expect the WFT to ends with "Error: Script execution timed out after 10000ms", but instead get an async hook stack corruption, as described above.

My investigation got it down to an issue in node involving: vm + microtaskMode: 'afterEvaluate' + runInContext with timeout + AsyncLocalStorage. I know for sure that it happens after the runInContext’s timeout fires. Seems like either the AsyncLocalStorage internal state is not properly cleaned up on context timeout, or there are still some live functions (probably native code) attached to that context's AsyncLocalStorage that tries to access the storage after it has been cleaned up.

Proposed solution is to replace Node's builtin but experimental implementation of AsyncLocalStorage with our own implementation.