Corrupted history event when workflow args are too large

shaunco commented 3 years ago

Expected Behavior

If ExecuteWorkflow is called with args that serialize to larger than the max history size, a proper error should be returned from ExecuteWorkflow and the workflow history should reflect that error.

Actual Behavior

A workflow execution is created with no history. Subsequent attempts to retrieve the workflow via tctl or temporal-web get Failed to get history on workflow or corrupted history event batch, eventID is not continuous, and it appears that subsequent workflow executions in the same namespace (even for different workflows) get stuck behind the now corrupt workflow.

We haven't yet found a way to manually clear this execution with no history through Temporal provided tools and end up having to manually clear it from the database.

Some tctl logs:

$ docker-compose run tctl --ns default wf list -op
WORKFLOW TYPE |            WORKFLOW ID               |                RUN ID                |      TASK QUEUE       | START TIME | EXECUTION TIME
    TS.Load   | ts-TS.Load-GTMTj7XPrsW6d8iVnohKGs6LZ | 2e30564c-8f4d-45e2-a5ed-d0af34c0a337 | TIMESERIES_TASK_QUEUE | 17:42:22   | 17:42:22

$ docker-compose run tctl --ns default wf show -wid ts-TS.Load-GTMTj7XPrsW6d8iVnohKGs6LZ
Error: Failed to get history on workflow id: ts-TS.Load-GTMTj7XPrsW6d8iVnohKGs6LZ, run id: .
Error Details: context deadline exceeded
('export TEMPORAL_CLI_SHOW_STACKS=1' to see stack traces)

$ docker-compose run tctl --ns default admin wf desc -wid ts-TS.Load-GTMTj7XPrsW6d8iVnohKGs6LZ
Cache mutable state:
{
  "executionInfo": {
    "namespaceId": "67f5aeeb-0190-43d7-9ecf-1827acc18083",
    "workflowId": "ts-TS.Load-GTMTj7XPrsW6d8iVnohKGs6LZ",
    "taskQueue": "TIMESERIES_TASK_QUEUE",
    "workflowTypeName": "TS.Load",
    "workflowExecutionTimeout": "0s",
    "workflowRunTimeout": "0s",
    "defaultWorkflowTaskTimeout": "10s",
    "lastEventTaskId": "1048613",
    "lastFirstEventId": "4",
    "startTime": "2021-02-07T17:42:22.593879300Z",
    "lastUpdateTime": "2021-02-08T15:34:19.889Z",
    "workflowTaskScheduleId": "5",
    "workflowTaskTimeout": "10s",
    "workflowTaskAttempt": 7832,
    "workflowTaskScheduledTime": "2021-02-08T15:34:19.889413600Z",
    "workflowTaskOriginalScheduledTime": "2021-02-08T15:34:19.889412300Z",
    "workflowTaskRequestId": "emptyUuid",
    "stickyScheduleToStartTimeout": "0s",
    "attempt": 1,
    "autoResetPoints": {

    },
    "versionHistories": {
      "histories": [
        {
          "branchToken": "CiQyZTMwNTY0Yy04ZjRkLTQ1ZTItYTVlZC1kMGFmMzRjMGEzMzcSJDY1ZDlhMjlmLTc3NzQtNGFiYS1hYjZjLTdiOWFjYWNiMjM0Nw==",
          "items": [
            {
              "eventId": "4"
            }
          ]
        }
      ]
    },
    "firstExecutionRunId": "2e30564c-8f4d-45e2-a5ed-d0af34c0a337",
    "executionStats": {
      "historySize": "53699"
    },
    "workflowRunExpirationTime": "0001-01-01T00:00:00Z"
  },
  "executionState": {
    "createRequestId": "db985d3e-5659-4c1c-bc68-184858dcb9e7",
    "runId": "2e30564c-8f4d-45e2-a5ed-d0af34c0a337",
    "state": "Running",
    "status": "Running"
  },
  "nextEventId": "5"
}
Database mutable state:
{
  "executionInfo": {
    "namespaceId": "67f5aeeb-0190-43d7-9ecf-1827acc18083",
    "workflowId": "ts-TS.Load-GTMTj7XPrsW6d8iVnohKGs6LZ",
    "taskQueue": "TS_TASK_QUEUE",
    "workflowTypeName": "TS.Load",
    "workflowExecutionTimeout": "0s",
    "workflowRunTimeout": "0s",
    "defaultWorkflowTaskTimeout": "10s",
    "lastEventTaskId": "1048613",
    "lastFirstEventId": "4",
    "startTime": "2021-02-07T17:42:22.593879300Z",
    "lastUpdateTime": "2021-02-08T15:34:19.889Z",
    "workflowTaskScheduleId": "5",
    "workflowTaskTimeout": "10s",
    "workflowTaskAttempt": 7832,
    "workflowTaskScheduledTime": "2021-02-08T15:34:19.889413600Z",
    "workflowTaskOriginalScheduledTime": "2021-02-08T15:34:19.889412300Z",
    "workflowTaskRequestId": "emptyUuid",
    "stickyScheduleToStartTimeout": "0s",
    "attempt": 1,
    "autoResetPoints": {

    },
    "versionHistories": {
      "histories": [
        {
          "branchToken": "CiQyZTMwNTY0Yy04ZjRkLTQ1ZTItYTVlZC1kMGFmMzRjMGEzMzcSJDY1ZDlhMjlmLTc3NzQtNGFiYS1hYjZjLTdiOWFjYWNiMjM0Nw==",
          "items": [
            {
              "eventId": "4"
            }
          ]
        }
      ]
    },
    "firstExecutionRunId": "2e30564c-8f4d-45e2-a5ed-d0af34c0a337",
    "executionStats": {
      "historySize": "53699"
    },
    "workflowRunExpirationTime": "0001-01-01T00:00:00Z"
  },
  "executionState": {
    "createRequestId": "db985d3e-5659-4c1c-bc68-184858dcb9e7",
    "runId": "2e30564c-8f4d-45e2-a5ed-d0af34c0a337",
    "state": "Running",
    "status": "Running"
  },
  "nextEventId": "5"
}
Current branch token:
{
  "tree_id": "2e30564c-8f4d-45e2-a5ed-d0af34c0a337",
  "branch_id": "65d9a29f-7774-4aba-ab6c-7b9acacb2347"
}
History service address: 172.22.0.3:7234
Shard Id: 1

$ docker-compose run tctl --ns default admin wf show --db_address 127.0.0.1 --db_port 9042 --tree_id 2e30564c-8f4d-45e2-a5ed-d0af34c0a337
Error: ReadHistoryBranch err
Error Details: ReadHistoryBranch. Close operation failed. Error: invalid UUID ""
('export TEMPORAL_CLI_SHOW_STACKS=1' to see stack traces)

Steps to Reproduce the Problem

Execute a workflow with args that serialize to larger than max history
Try to get any details about the "Running" execution

Specifications

Version: 1.6.3
Platform:

shaunco commented 3 years ago

Digging deeper in to this, it was not actually args larger than max history, but a total payload larger than Scylla's batch_size_fail_threshold (default is 50kb):

BatchStatement - Batch of prepared statements for temporal.history_tree, temporal.history_node is of size 54566, exceeding specified FAIL threshold of 51200 by 3366.

We can work around this for now by upping Scylla's batch_size_fail_threshold, but leaving this issue open as Temporal should fail gracefully and rollback/update the execution state when it gets a DB error, not end up in a broken state that requires manual DB modifications.

wxing1292 commented 3 years ago

when it gets a DB error

is there any error returned by DB / logged by temporal? from the top of my head, if DB error is returned, business logic should return that error to caller.

a data point from MySQL: previously the schema was not set to allow more than 64K payload, and DB just silently truncate data: https://github.com/temporalio/temporal/pull/1056

wxing1292 commented 3 years ago

@shaunco btw, you can manually delete those corrupted workflow by tctl admin workflow delete -h

wxing1292 commented 3 years ago

seem that my local test setup is not working correctly, let me try using online cassandra

UPDATE: seems that our test cluster setup (3 nodes) also unable to verify (probably due to cluster size == 3).

~~@shaunco did try to repo the issue locally:~~

~~confirm the issue is there~~

~~seems that cass fails silently (no error logged / returned)~~

shaunco commented 3 years ago

Maybe it is specific to Scylla? I get it with this docker-compose.yml and a 55kb argument.

version: "3.7"

services:
  # ScyllaDB
  scylla:
    image: scylladb/scylla:4.2.0
    container_name: scylla
    ports:
      - 0.0.0.0:9042:9042
    networks:
      - scylla-test
    volumes:
      - scylla-data:/var/lib/scylla

  # Temporal, configured to use Scylla
  temporal:
    image: temporalio/auto-setup:${SERVER_TAG:-1.6.3}
    container_name: temporal
    ports:
      - "7233:7233"
    networks:
      - scylla-test
    volumes:
      - ./config/dynamicconfig:/etc/temporal/config/dynamicconfig
    environment:
      CASSANDRA_SEEDS: "scylla"
      CASSANDRA_PORT: "9042"
      DYNAMIC_CONFIG_FILE_PATH: "config/dynamicconfig/development.yaml"
    depends_on:
      - scylla

  # Temporal web portal - http://localhost:7230
  temporal-web:
    image: temporalio/web:${WEB_TAG:-1.6.1}
    ports:
      - "7230:8088"
    networks:
      - scylla-test
    environment:
      - "TEMPORAL_GRPC_ENDPOINT=temporal:7233"
      - "TEMPORAL_PERMIT_WRITE_API=true"
    depends_on:
      - temporal

  # Temporal CLI
  tctl:
    image: temporalio/tctl:${SERVER_TAG:-1.6.3}
    networks:
      - scylla-test
    environment:
      - "TEMPORAL_CLI_ADDRESS=temporal:7233"
    depends_on:
      - temporal      

networks:
  scylla-test:

volumes:
  scylla-data:

wxing1292 commented 3 years ago

Maybe it is specific to Scylla? I get it with this docker-compose.yml and a 55kb argument.

version: "3.7"

services:
  # ScyllaDB
  scylla:
    image: scylladb/scylla:4.2.0
    container_name: scylla
    ports:
      - 0.0.0.0:9042:9042
    networks:
      - scylla-test
    volumes:
      - scylla-data:/var/lib/scylla

  # Temporal, configured to use Scylla
  temporal:
    image: temporalio/auto-setup:${SERVER_TAG:-1.6.3}
    container_name: temporal
    ports:
      - "7233:7233"
    networks:
      - scylla-test
    volumes:
      - ./config/dynamicconfig:/etc/temporal/config/dynamicconfig
    environment:
      CASSANDRA_SEEDS: "scylla"
      CASSANDRA_PORT: "9042"
      DYNAMIC_CONFIG_FILE_PATH: "config/dynamicconfig/development.yaml"
    depends_on:
      - scylla

  # Temporal web portal - http://localhost:7230
  temporal-web:
    image: temporalio/web:${WEB_TAG:-1.6.1}
    ports:
      - "7230:8088"
    networks:
      - scylla-test
    environment:
      - "TEMPORAL_GRPC_ENDPOINT=temporal:7233"
      - "TEMPORAL_PERMIT_WRITE_API=true"
    depends_on:
      - temporal

  # Temporal CLI
  tctl:
    image: temporalio/tctl:${SERVER_TAG:-1.6.3}
    networks:
      - scylla-test
    environment:
      - "TEMPORAL_CLI_ADDRESS=temporal:7233"
    depends_on:
      - temporal      

networks:
  scylla-test:

volumes:
  scylla-data:

let me take a try

wxing1292 commented 3 years ago

when i configure the test workflow to start a child workflow with 1MB payload, this is what is see (using the above docker compose)

"AppendHistoryNodes operation failed. Error: Batch too large"

scylla is reporting

scylla          | ERROR 2021-02-09 00:51:59,929 [shard 3] BatchStatement - Batch of prepared statements for temporal.history_tree, temporal.history_node is of size 1051978, exceeding specified FAIL threshold of 51200 by 1000778.

@shaunco can we sync on slack? i may need to have little bit more information (Failed to get history on workflow or corrupted history event batch, eventID is not continuous) One more thing, we currently only officially support cassandra 3.11 / mysql 5.7 / postgresql 9.6

wxing1292 commented 3 years ago

tried my local cassandra setup as well as default cassandra docker (both 3.11), seems that either batch_size_fail_threshold carries a different meaning (cassandra vs scylla) or simply ignored

let us talk on slack.

wxing1292 commented 3 years ago

update: seems that cassandra / scylla are behaving differently (need more confirmation from both sides) .

in cassandra the batch query in question is targeting at one partition (seem to be a guaranteed bahavior) in scylla the batch query in question is target at multiple partition (2)

^ means batch_size_fail_threshold is not being evaluated in cassandra in this case

wxing1292 commented 3 years ago

since right now scylla DB is not officially supported, i will close this ticket.

plz reopen this ticket if you have any comments / concern

jeacott1 commented 10 months ago

Any chance of adding a feature like conductor has with a transparent payload storage? https://conductor.netflix.com/devguide/architecture/technicaldetails.html#external-payload-storage

temporalio / temporal