temporalio / sdk-go

Temporal Go SDK
https://docs.temporal.io/application-development?lang=go
MIT License
502 stars 204 forks source link

Unknown NonDeterministicError #1554

Closed vl4deee11 closed 5 days ago

vl4deee11 commented 1 month ago

Expected Behavior

No NonDeterministicError

Actual Behavior

NonDeterministicError

I have a workflow that works great and at the end of the workflow there is a timer that does a simple thing, look at the code below:

func CourierWorkflow(ctx workflow.Context, order *models.Order) error {
    wg := workflow.NewWaitGroup(ctx)

    // ....
    var statusUpdatedChan = workflow.NewChannel(ctx)
    err = workflow.SetUpdateHandler(ctx, StatusUpdateHandler, onStatusUpdate(ctx, order, wg, statusUpdatedChan))
    if err != nil {
        logger.Error("SetUpdateHandler failed", "Error", err)
        return err
    }

    // ....

    selector.AddReceive(statusUpdatedChan, func(c workflow.ReceiveChannel, _ bool) { c.Receive(ctx, &err) })

    doneTimeFuture := workflow.NewTimer(ctx, 48*time.Hour)
    selector.AddFuture(doneTimeFuture, func(_ workflow.Future) {
        order.Status = models.StatusDone
    })

    for {
        selector.Select(ctx)
        if order.Status == models.StatusDone {
            break
        }
        // another actions
    }
}

func onStatusUpdate(ctx workflow.Context, order *models.Order, wg workflow.WaitGroup, statusUpdatedChan workflow.Channel) func(ctx workflow.Context, signal models.OrderSignal) (*models.Order, error) {
    logger := workflow.GetLogger(ctx)
    return func(ctx workflow.Context, signal models.OrderSignal) (*models.Order, error) {
        logger.Info("onStatusUpdate: got new update", "Status", signal.Status, "PrevStatus", order.Status)
        orderStatus := order.Status
        switch orderStatus {
        //... update order data
        default:
            return order, nil
        }

        statusUpdatedChan.Send(ctx, afterStatusUpdate(ctx, order, wg, nil))
        return order, nil
    }
}

func afterStatusUpdate(ctx workflow.Context, order *models.Order, wg workflow.WaitGroup, canceledStatus *models.Status) error {
    switch order.Status {
    case models.StatusAtVendor, models.StatusInDelivery, models.StatusAtClient, models.StatusDelivered:
        // -> PANIC HERE, code works from 2023 with no problem
        err := activities.FinishCourierStepForOrderExecute(ctx, order)
        if err != nil {
            return err
        }
    case models.StatusCanceled:
        activities.SkipCourierStepsForOrderExecute(ctx, order)
    }
    return nil
}

And with this timer, everything is fine in 99.9% of cases, but sometimes a non-determinism error crashes, and the replays go normal on the same workflow on which the non-determinism error occurs

Replay works great, no problem with workflows

Stack trace:

process event for courier [panic]:
go.temporal.io/sdk/internal.panicIllegalState(...)
    /go/src/git.uzum.io/foodtech/develop/operations/logistics-orders/vendor/go.temporal.io/sdk/internal/internal_command_state_machine.go:455
go.temporal.io/sdk/internal.(*commandsHelper).handleActivityTaskScheduled(0xc009dfc3c0, {0xc00d45cbfc, 0x3}, 0x0?)
    /go/src/git.uzum.io/foodtech/develop/operations/logistics-orders/vendor/go.temporal.io/sdk/internal/internal_command_state_machine.go:1038 +0x109
go.temporal.io/sdk/internal.(*workflowExecutionEventHandlerImpl).ProcessEvent(0xc00cf42d98, 0xc00de35400, 0x0?, 0x0)
    /go/src/git.uzum.io/foodtech/develop/operations/logistics-orders/vendor/go.temporal.io/sdk/internal/internal_event_handlers.go:1031 +0x24a
go.temporal.io/sdk/internal.(*workflowExecutionContextImpl).ProcessWorkflowTask(0xc0098de900, 0xc00b29f170)
    /go/src/git.uzum.io/foodtech/develop/operations/logistics-orders/vendor/go.temporal.io/sdk/internal/internal_task_handlers.go:960 +0x11bf
go.temporal.io/sdk/internal.(*workflowTaskHandlerImpl).ProcessWorkflowTask(0xc000166340, 0xc00b29f170, 0xc006727740)
    /go/src/git.uzum.io/foodtech/develop/operations/logistics-orders/vendor/go.temporal.io/sdk/internal/internal_task_handlers.go:778 +0x4f6
go.temporal.io/sdk/internal.(*workflowTaskPoller).processWorkflowTask(0xc000984400, 0xc00b29f170)
    /go/src/git.uzum.io/foodtech/develop/operations/logistics-orders/vendor/go.temporal.io/sdk/internal/internal_task_pollers.go:329 +0x2dd
go.temporal.io/sdk/internal.(*workflowTaskPoller).ProcessTask(0xc000984400, {0x157ff00?, 0xc00b29f170})
    /go/src/git.uzum.io/foodtech/develop/operations/logistics-orders/vendor/go.temporal.io/sdk/internal/internal_task_pollers.go:302 +0x6c
go.temporal.io/sdk/internal.(*baseWorker).processTask(0xc0001ba280, {0x157fa80?, 0xc0081f2330})
    /go/src/git.uzum.io/foodtech/develop/operations/logistics-orders/vendor/go.temporal.io/sdk/internal/internal_worker_base.go:440 +0x167
created by go.temporal.io/sdk/internal.(*baseWorker).runTaskDispatcher
    /go/src/git.uzum.io/foodtech/develop/operations/logistics-orders/vendor/go.temporal.io/sdk/internal/internal_worker_base.go:334 +0xb5

Steps to Reproduce the Problem

  1. -

Specifications

Quinn-With-Two-Ns commented 1 month ago

Can you try downloading the history for these failing workflows and testing with the replayer and see if the error reproduces?

https://docs.temporal.io/develop/go/testing-suite#replay

I also notice you using SetUpdateHandler on a vey old SDK and an older Server release, Workflow Update is an experimental feature and numerous bug fixes and API changes have been made since that release so it is possible it is related to workflow update and upgrading would fix it.

vl4deee11 commented 1 month ago

@Quinn-With-Two-Ns Hello, as I wrote earlier, the replay is successful, without errors, and the only version that I can update the sdk to is 1.24.1. SetUpdateHandler I took from the documentation here https://docs.temporal.io/develop/go/message-passing#handle-update

Quinn-With-Two-Ns commented 1 month ago

The latest Go SDK is v1.28.1 there are multiple bugs on fixed on the Server and SDK that could cause this behaviour your seeing. I would recommend upgrading your SDK and Server to the latest stable release.

vl4deee11 commented 1 month ago

I found one bug in sdk in version 1.24.1, now I will fix it at the level of my code, because it is very difficult to update the sdk and the temporal version in the prod, given that the latest updates did not go very smoothly