nforgeio / neonKUBE

Public NeonKUBE Kubernetes distribution related projects
https://neonkube.io
Apache License 2.0
78 stars 13 forks source link

WorkflowParallelOperationException #775

Closed marcusbooyah closed 4 years ago

marcusbooyah commented 4 years ago

Ok, I've found a way to reliable reproduce this. It seems like the last workflow standing will complete successfully, but all other concurrent workflows will fail until there is just 1 left.

Here's the workflow:

[WorkflowMethod(WorkflowIdReusePolicy = WorkflowIdReusePolicy.RejectDuplicate, Name = "process-order")]
public async Task<int> ProcessOrderAsync(Order order)
{ 
    // order activity
    orderActivity = Workflow.NewLocalActivityStub<IActivityOrder, ActivityOrder>(
        new LocalActivityOptions()
        {
            ScheduleToCloseTimeout = TimeSpan.FromHours(1),
            RetryOptions = new RetryOptions()
            {
                MaximumAttempts = 10,
                BackoffCoefficient = 2.0,
                InitialInterval = TimeSpan.FromSeconds(1),
                MaximumInterval = TimeSpan.FromSeconds(30),
                NonRetriableErrors = new List<string>()
            }
        });

    for (int i = 0; i < await Workflow.NextRandomAsync(1000); i++)
    {
        await orderActivity.SleepAsync(order.Id);
    }

    return 0;
}

[ActivityMethod(Name = "sleeparoo")]
public async Task SleepAsync(string id)
{
    await Task.Delay(10);
}
jefflill commented 4 years ago

This is probably the true scenario that was causing #773 too.

jefflill commented 4 years ago

This may be a problem with cadence-proxy.

The .NET side is seeing WorkflowInvokeRequest messages from cadence-proxy with the same ContextId for different workflows. I poked around the GO code a bit and see that @johncburns1 is protecting NextContextId() with a mutex, but perhaps your not using this everywhere.

I've recreated the neonKUBE cadence branch and have added some temporary debug logging. The log output will be appended to C:\temp\cadence-debug.log. I generally open this in Notepad++ and clear and save this file before each run.

Here's what I'm seeing in this log when I run Marcus' workflow multiple times:

2020-03-10T10:18:35.738Z: Workflow Invoke Message: workflowId=wf-stripe-payment:b0c71f58-750c-49c2-babf-587ff4ac8714 contextId=1
2020-03-10T10:18:37.192Z: Workflow Invoke Message: workflowId=wf-customer-app-order:b0c71f58-750c-49c2-babf-587ff4ac8714 contextId=1
2020-03-10T10:18:39.856Z: Workflow Invoke Message: workflowId=wf-stripe-payment:7302e95d-5543-4224-b585-7d9b1b53a9d9 contextId=2
2020-03-10T10:18:41.131Z: Workflow Invoke Message: workflowId=wf-customer-app-order:7302e95d-5543-4224-b585-7d9b1b53a9d9 contextId=2
2020-03-10T10:18:42.092Z: ******* Parallel Execution: workflowId=wf-customer-app-order:7302e95d-5543-4224-b585-7d9b1b53a9d9 contextId=2

BTW: I discovered a cool way to debug packages referenced by other projects without having to publish nuget packages with new version numbers etc. The trick is to update the DLL (and PDB) you're modifying in your machines nuget package cache.

In this case, I added the debug logging to the Neon.Cadence code, rebuilt the library and then copied these files:

C:\src\neonKUBE\Lib\Neon.Cadence\bin\Debug\netstandard2.0
   Neon.Cadence.dll
   Neon.Cadence.pdb

to

C:\Users\jeff\.nuget\packages\neon.cadence\1.2.2\lib\netstandard2.0
   Neon.Cadence.dll
   Neon.Cadence.pdb

Then all you need to do is rebuild the project referencing the package and it will include the new library bits. The extra cool thing is that you can set breakpoints in the library code. I think this works when the library was rebuilt on the same machine your debugging on.

When you're done debugging, you should remove this package from the nuget cache and then restore packages to get back to the official published package bits.


FOLLOW-UP: This ended up being a non issue. Jack realized that the test was running multiple workflow workers and they all were logging to the same file. So, we'd expect each of these workers to maintain their own set of local IDs and thus duplicate IDs across all three is to be expected. Note that the last two Workflow Invoke Messages have different workflow IDs but share the same contextId=2. This results in both workflows sharing the same context which is really bad and is resulting in the parallel operation exception because the two workflows are in fact performing parallel operations on the same workflow context.

johncburns1 commented 4 years ago

Some interesting results from exploration today:

There were some other interesting findings when I ran some of these tests. In some cases other errors were thrown in addition to the workflow parallel operations error. The first 3 panics were thrown on the same run and the last error, EntityNotExists, is thrown on almost every run when you continue to submit orders after the parallel operation exception has been thrown.

Illegal Access Outside of Workflow Context

Neon.Cadence.CadenceCustomException: 'Panic: getState: illegal access from outside of workflow context, MessageType: ActivityExecuteLocalRequest, RequestId: 485, ClientId: 6. goroutine 3498 [running]:
runtime/debug.Stack(0xc0004f05a0, 0x6, 0x1b)
    c:/go/src/runtime/debug/stack.go:24 +0xa4
github.com/cadence-proxy/internal/endpoints.handleIProxyRequest.func1(0xfba180, 0xc0005fe030, 0xc0007cfe00, 0xc0007cfe40)
    C:/Users/johnc/source/repos/neonKUBE/Go/src/github.com/cadence-proxy/internal/endpoints/message.go:152 +0x141
panic(0xcbac00, 0xf80a60)
    c:/go/src/runtime/panic.go:679 +0x1c0
go.uber.org/cadence/internal.getState(0xfa71a0, 0xc0005fa3f0, 0xc0007cfa78)
    C:/Users/johnc/source/repos/neonKUBE/Go/src/github.com/cadence-proxy/vendor/go.uber.org/cadence/internal/internal_workflow.go:529 +0xa4
go.uber.org/cadence/internal.NewChannel(0xfa71a0, 0xc0005fa3f0, 0x10, 0xc0002fee20)
    C:/Users/johnc/source/repos/neonKUBE/Go/src/github.com/cadence-proxy/vendor/go.uber.org/cadence/internal/workflow.go:278 +0x40
go.uber.org/cadence/internal.newDecodeFuture(...)
    C:/Users/johnc/source/repos/neonKUBE/Go/src/github.com/cadence-proxy/vendor/go.uber.org/cadence/internal/internal_workflow.go:1272
go.uber.org/cadence/internal.ExecuteLocalActivity(0xfa71a0, 0xc0005fa3f0, 0xd07ca0, 0xc0005a45e0, 0xc0002fee20, 0x1, 0x1, 0x5, 0x1d3)
    C:/Users/johnc/source/repos/neonKUBE/Go/src/github.com/cadence-proxy/vendor/go.uber.org/cadence/internal/workflow.go:467 +0x54
go.uber.org/cadence/workflow.ExecuteLocalActivity(...)
    C:/Users/johnc/source/repos/neonKUBE/Go/src/github.com/cadence-proxy/vendor/go.uber.org/cadence/workflow/workflow.go:157
github.com/cadence-proxy/internal/endpoints.handleActivityExecuteLocalRequest(0xfa6ae0, 0xc000032170, 0xc0005fe030, 0x4, 0xf7fe60)
    C:/Users/johnc/source/repos/neonKUBE/Go/src/github.com/cadence-proxy/internal/endpoints/request_handlers.go:2549 +0x774
github.com/cadence-proxy/internal/endpoints.handleIProxyRequest(0xfba180, 0xc0005fe030, 0x0, 0x0)
    C:/Users/johnc/source/repos/neonKUBE/Go/src/github.com/cadence-proxy/internal/endpoints/message.go:486 +0x1a4e
github.com/cadence-proxy/internal/endpoints.processIncomingMessage(0xfb2540, 0xc0005fe030, 0xc0005cc120)
    C:/Users/johnc/source/repos/neonKUBE/Go/src/github.com/cadence-proxy/internal/endpoints/message.go:116 +0x100
created by github.com/cadence-proxy/internal/endpoints.MessageHandler
    C:/Users/johnc/source/repos/neonKUBE/Go/src/github.com/cadence-proxy/internal/endpoints/message.go:69 +0x146

Trying to Block on Couroutine Which is Already Blocked

Neon.Cadence.CadenceCustomException: 'Panic: trying to block on coroutine which is already blocked, most likely a wrong Context is used to do blocking call (like Future.Get() or Channel.Receive(), MessageType: ActivityExecuteLocalRequest, RequestId: 566, ClientId: 6. goroutine 4330 [running]:
runtime/debug.Stack(0xc000693050, 0x6, 0x1b)
    c:/go/src/runtime/debug/stack.go:24 +0xa4
github.com/cadence-proxy/internal/endpoints.handleIProxyRequest.func1(0xfba180, 0xc0002c8168, 0xc000727e00, 0xc000727e40)
    C:/Users/johnc/source/repos/neonKUBE/Go/src/github.com/cadence-proxy/internal/endpoints/message.go:152 +0x141
panic(0xcbac00, 0xf80aa0)
    c:/go/src/runtime/panic.go:679 +0x1c0
go.uber.org/cadence/internal.(*coroutineState).initialYield(0xc0003ad300, 0x3, 0xc0002cab80, 0x19)
    C:/Users/johnc/source/repos/neonKUBE/Go/src/github.com/cadence-proxy/vendor/go.uber.org/cadence/internal/internal_workflow.go:742 +0xc9
go.uber.org/cadence/internal.(*coroutineState).yield(...)
    C:/Users/johnc/source/repos/neonKUBE/Go/src/github.com/cadence-proxy/vendor/go.uber.org/cadence/internal/internal_workflow.go:757
go.uber.org/cadence/internal.(*channelImpl).Receive(0xc000867c20, 0xfa71a0, 0xc00081ebd0, 0x0, 0x0, 0x0)
    C:/Users/johnc/source/repos/neonKUBE/Go/src/github.com/cadence-proxy/vendor/go.uber.org/cadence/internal/internal_workflow.go:572 +0x2a4
go.uber.org/cadence/internal.(*decodeFutureImpl).Get(0xc0004b1600, 0xfa71a0, 0xc00081ebd0, 0xc8d120, 0xc0004b1640, 0x1, 0x1)
    C:/Users/johnc/source/repos/neonKUBE/Go/src/github.com/cadence-proxy/vendor/go.uber.org/cadence/internal/internal_workflow.go:1231 +0x70
github.com/cadence-proxy/internal/endpoints.handleActivityExecuteLocalRequest(0xfa6ae0, 0xc000032170, 0xc0002c8168, 0x4, 0xf7fe60)
    C:/Users/johnc/source/repos/neonKUBE/Go/src/github.com/cadence-proxy/internal/endpoints/request_handlers.go:2557 +0x7e7
github.com/cadence-proxy/internal/endpoints.handleIProxyRequest(0xfba180, 0xc0002c8168, 0x0, 0x0)
    C:/Users/johnc/source/repos/neonKUBE/Go/src/github.com/cadence-proxy/internal/endpoints/message.go:486 +0x1a4e
github.com/cadence-proxy/internal/endpoints.processIncomingMessage(0xfb2540, 0xc0002c8168, 0xc0005d6fc0)
    C:/Users/johnc/source/repos/neonKUBE/Go/src/github.com/cadence-proxy/internal/endpoints/message.go:116 +0x100
created by github.com/cadence-proxy/internal/endpoints.MessageHandler
    C:/Users/johnc/source/repos/neonKUBE/Go/src/github.com/cadence-proxy/internal/endpoints/message.go:69 +0x146

No Cached Result Found for Side Effect

Neon.Cadence.CadenceCustomException: 'Panic: No cached result found for side effectID=1. KnownSideEffects=[0], MessageType: WorkflowMutableRequest, RequestId: 743, ClientId: 6. goroutine 6049 [running]:
runtime/debug.Stack(0xc000647350, 0x6, 0x16)
    c:/go/src/runtime/debug/stack.go:24 +0xa4
github.com/cadence-proxy/internal/endpoints.handleIProxyRequest.func1(0xfbbb20, 0xc0004681e8, 0xc000491e00, 0xc000491e40)
    C:/Users/johnc/source/repos/neonKUBE/Go/src/github.com/cadence-proxy/internal/endpoints/message.go:152 +0x141
panic(0xcbac00, 0xc00064be50)
    c:/go/src/runtime/panic.go:679 +0x1c0
go.uber.org/cadence/internal.(*workflowEnvironmentImpl).SideEffect(0xc000562e00, 0xc0004ec990, 0xc0004ec9c0)
    C:/Users/johnc/source/repos/neonKUBE/Go/src/github.com/cadence-proxy/vendor/go.uber.org/cadence/internal/internal_event_handlers.go:593 +0x86a
go.uber.org/cadence/internal.SideEffect(0xfa6fa0, 0xc0002d1900, 0xc00064be40, 0x9, 0xc0003ab0d0)
    C:/Users/johnc/source/repos/neonKUBE/Go/src/github.com/cadence-proxy/vendor/go.uber.org/cadence/internal/workflow.go:1022 +0x24f
go.uber.org/cadence/workflow.SideEffect(...)
    C:/Users/johnc/source/repos/neonKUBE/Go/src/github.com/cadence-proxy/vendor/go.uber.org/cadence/workflow/workflow.go:261
github.com/cadence-proxy/internal/endpoints.handleWorkflowMutableRequest(0xfa6ae0, 0xc000032170, 0xc0004681e8, 0x4, 0xf7fe60)
    C:/Users/johnc/source/repos/neonKUBE/Go/src/github.com/cadence-proxy/internal/endpoints/request_handlers.go:935 +0x6ea
github.com/cadence-proxy/internal/endpoints.handleIProxyRequest(0xfbbb20, 0xc0004681e8, 0x0, 0x0)
    C:/Users/johnc/source/repos/neonKUBE/Go/src/github.com/cadence-proxy/internal/endpoints/message.go:321 +0x108d
github.com/cadence-proxy/internal/endpoints.processIncomingMessage(0xfb52c0, 0xc0004681e8, 0xc0008acae0)
    C:/Users/johnc/source/repos/neonKUBE/Go/src/github.com/cadence-proxy/internal/endpoints/message.go:116 +0x100
created by github.com/cadence-proxy/internal/endpoints.MessageHandler
    C:/Users/johnc/source/repos/neonKUBE/Go/src/github.com/cadence-proxy/internal/endpoints/message.go:69 +0x146

Entity Not Exists

[2020-03-11T01:19:31.569+00:00] [ERROR] [module:Neon.Cadence.WorkflowBase] [index:16] Neon.Cadence.EntityNotExistsException: EntityNotExistsError\n   
    at Neon.Cadence.Internal.ProxyReply.ThrowOnError()\n   
    at Neon.Cadence.Workflow.ExecuteLocalActivityAsync(Type activityType, ConstructorInfo activityConstructor, MethodInfo activityMethod, Byte[] args, LocalActivityOptions options)\n   
    at Neon.Cadence.Stubs.ActivityOrder_Stub_14.___StubHelper.ExecuteLocalActivityAsync(Workflow workflow, Type activityType, ConstructorInfo constructor, MethodInfo method, Byte[] args, LocalActivityOptions options)\n   
    at Neon.Cadence.Stubs.ActivityOrder_Stub_14.SleepAsync()\n   
    at WFCustomerAppOrder.WorkflowOrder.ProcessOrderAsync(Order order) in C:\\Users\\johnc\\source\\repos\\loopie\\Workflows\\wf-customer-app-order\\WorkflowOrder.cs:line 101\n   
    at Neon.Common.NeonHelper.GetTaskResultAsObjectAsync(Task task)\n   at Neon.Cadence.WorkflowBase.OnInvokeAsync(CadenceClient client, WorkflowInvokeRequest request)
jefflill commented 4 years ago

@johncburns1 figured this out. This ended up being due to the workflow saving activity stubs as static rather than instance properties. When the second workflow started, it would create new stubs, save them to the static properties, which would break these stubs when called from other workflow instances.

This is not a CadenceClient bug but perhaps we could figure out a way to throw a more informative exception.