temporalio / sdk-go

Temporal Go SDK
https://docs.temporal.io/application-development?lang=go
MIT License
528 stars 210 forks source link

Specifying StartToCloseTimeout but no ScheduleToStartTimeout or ScheduleToCloseTimeout does not return an error #675

Closed tminusplus closed 1 year ago

tminusplus commented 2 years ago

Expected Behavior

Executing an activity with a defined StartToCloseTimeout, an undefined ScheduleToStartTimeout, and an undefined ScheduleToCloseTimeout raises an error.

Originally, we expected the workflow to continue retrying forever - but closer reading of the docs revealed that you must set either ScheduleToStartTimeout or ScheduleToCloseTimeout here. The doc can be confusing, because you have to read all three timeouts to know what must be set, if I understand it correctly.

It is possible that this behavior is expected, and just not clearly defined within the docs. I tried looking in the Go SDK docs and on temporal.io. Please let me know if that is the case! The fix is easy on our side but I wanted to report a bug as the behaviour surprised us.

Actual Behavior

Executing an activity with a defined StartToCloseTimeout, an undefined ScheduleToStartTimeout, and an undefined ScheduleToCloseTimeout does not raise an error and runs normally. Then if a WorkflowTaskTimedOut occurs, we observed this with a ScheduleToStart timeout type, the workflow task will fail and retryState will be set to RetryPolicyNotSet.

Steps to Reproduce the Problem

  1. Execute a workflow with default WorkflowOptions.
  2. In that workflow, execute an activity with only StartToCloseTimeout specified in ActivityOptions.
  3. Return an error immediately within the activity task.
  4. Spam the worker with these tasks so that one times out due to ScheduleToStartTimeout.
  5. Once the workflow task times out due to ScheduleToStart, the workflow execution will fail.

Specifications

Notes

I wanted to provide some more info on this bug, so it is easier to reproduce. I am unable to share the full workflow logs, but can share some of the non-sensitive parts.

Workflow task started with:

        "workflowExecutionTimeout": "0s",
        "workflowRunTimeout": "0s",
        "workflowTaskTimeout": "10s",
        "attempt": 1,
        "cronSchedule": "",

Activity task started with:

        "scheduleToCloseTimeout": "0s",
        "scheduleToStartTimeout": "0s",
        "startToCloseTimeout": "5s",
        "heartbeatTimeout": "0s",
        "retryPolicy": {
          "nonRetryableErrorTypes": [],
          "initialInterval": "1s",
          "backoffCoefficient": 2,
          "maximumInterval": "100s",
          "maximumAttempts": 0
        }

Activity task failed with this, which is the error I was throwing to test live failures:

        "lastFailure": {
          "message": "testing job failure for metrics dashboard",
          "source": "GoSDK",
          "stackTrace": "",
          "applicationFailureInfo": {
            "type": "",
            "nonRetryable": false
          },
          "failureInfo": "applicationFailureInfo"
        }

Workflow task scheduled as sticky:

      "workflowTaskScheduledEventAttributes": {
        "taskQueue": {
          ...
          "kind": "Sticky"
        },
        "startToCloseTimeout": "10s",
        "attempt": 1
      }

Then it timed out:

      "eventType": "WorkflowTaskTimedOut",
      "workflowTaskTimedOutEventAttributes": {
        "scheduledEventId": "8",
        "startedEventId": "0",
        "timeoutType": "ScheduleToStart"
      }

And finally, it failed:

      "workflowExecutionFailedEventAttributes": {
        "failure": {
          "message": "testing failure for metrics",
          "source": "GoSDK",
          "stackTrace": "",
          "applicationFailureInfo": {
            "type": "",
            "nonRetryable": false
          }
        },
        "retryState": "RetryPolicyNotSet",
      }

I tried root causing this, and I think it might be caused around here https://github.com/temporalio/temporal/blob/3de5687db2eda2ad8367beaff4c26e439c84f62c/service/history/commandChecker.go#L306

cretz commented 2 years ago

The title says:

Specifying StartToCloseTimeout but no ScheduleToStartTimeout or ScheduleToCloseTimeout does not return an error

It is possible that this behavior is expected, and just not clearly defined within the docs.

Definitely expected behavior (and is expected across all SDKs, as it is enforced server side). Simply: At least one of ScheduleToCloseTimeout or StartToCloseTimeout is required. From the Godoc on ActivityOptions you linked (other stuff left off for clarity):

    // Either this option or StartToClose is required: Defaults to unlimited.
    ScheduleToCloseTimeout time.Duration

    // If ScheduleToClose is not provided then this timeout is required: Defaults to the ScheduleToCloseTimeout value.
    StartToCloseTimeout time.Duration

This is also documented at https://docs.temporal.io/docs/go/how-to-set-activityoptions-in-go. We would be happy to make this clearer in any way you can think of. Suggestions?

tminusplus commented 2 years ago

That makes sense, thank you for the clarification. The part that was confusing was these three comments from the godoc:

    // ...
    // Either this option or StartToClose is required: Defaults to unlimited.
    ScheduleToCloseTimeout time.Duration

    // ...
    // If ScheduleToClose is not provided then this timeout is required.
    // Optional: Defaults to unlimited.
    ScheduleToStartTimeout time.Duration

    // ...
    // If ScheduleToClose is not provided then this timeout is required: Defaults to the ScheduleToCloseTimeout value.
    StartToCloseTimeout time.Duration

It gave me the impression that if ScheduleToClose is not provided then both StartToClose and ScheduleToStart are required. Perhaps a potential improvement might be:

    // ...
    // If StartToClose is not provided then this timeout or ScheduleToStart are required. 
        // Defaults to unlimited.
    ScheduleToCloseTimeout time.Duration

    // ...
    // If StartToClose is not provided then this timeout or ScheduleToClose are required.
    // Optional: Defaults to unlimited.
    ScheduleToStartTimeout time.Duration

    // ...
    // If ScheduleToClose or ScheduleToStart are not provided then this timeout is required.
        // Defaults to the ScheduleToCloseTimeout value.
    StartToCloseTimeout time.Duration
cretz commented 2 years ago
    // If StartToClose is not provided then this timeout or ScheduleToStart are required. 
        // Defaults to unlimited.
    ScheduleToCloseTimeout time.Duration

I am not sure this is accurate. You must have at least one of ScheduleToCloseTimeout or StartToCloseTimeout timeout set. That requirement is unrelated to the ScheduleToStartTimeout requirement (at least from my reading, I may have to go confirm behavior).