Closed shaunco closed 3 years ago
Digging deeper in to this, it was not actually args
larger than max history, but a total payload larger than Scylla's batch_size_fail_threshold
(default is 50kb):
BatchStatement - Batch of prepared statements for temporal.history_tree, temporal.history_node is of size 54566, exceeding specified FAIL threshold of 51200 by 3366.
We can work around this for now by upping Scylla's batch_size_fail_threshold
, but leaving this issue open as Temporal should fail gracefully and rollback/update the execution state when it gets a DB error, not end up in a broken state that requires manual DB modifications.
when it gets a DB error
is there any error returned by DB / logged by temporal? from the top of my head, if DB error is returned, business logic should return that error to caller.
a data point from MySQL: previously the schema was not set to allow more than 64K payload, and DB just silently truncate data: https://github.com/temporalio/temporal/pull/1056
@shaunco btw, you can manually delete those corrupted workflow by tctl admin workflow delete -h
seem that my local test setup is not working correctly, let me try using online cassandra
UPDATE: seems that our test cluster setup (3 nodes) also unable to verify (probably due to cluster size == 3).
@shaunco did try to repo the issue locally:
Maybe it is specific to Scylla? I get it with this docker-compose.yml
and a 55kb argument.
version: "3.7"
services:
# ScyllaDB
scylla:
image: scylladb/scylla:4.2.0
container_name: scylla
ports:
- 0.0.0.0:9042:9042
networks:
- scylla-test
volumes:
- scylla-data:/var/lib/scylla
# Temporal, configured to use Scylla
temporal:
image: temporalio/auto-setup:${SERVER_TAG:-1.6.3}
container_name: temporal
ports:
- "7233:7233"
networks:
- scylla-test
volumes:
- ./config/dynamicconfig:/etc/temporal/config/dynamicconfig
environment:
CASSANDRA_SEEDS: "scylla"
CASSANDRA_PORT: "9042"
DYNAMIC_CONFIG_FILE_PATH: "config/dynamicconfig/development.yaml"
depends_on:
- scylla
# Temporal web portal - http://localhost:7230
temporal-web:
image: temporalio/web:${WEB_TAG:-1.6.1}
ports:
- "7230:8088"
networks:
- scylla-test
environment:
- "TEMPORAL_GRPC_ENDPOINT=temporal:7233"
- "TEMPORAL_PERMIT_WRITE_API=true"
depends_on:
- temporal
# Temporal CLI
tctl:
image: temporalio/tctl:${SERVER_TAG:-1.6.3}
networks:
- scylla-test
environment:
- "TEMPORAL_CLI_ADDRESS=temporal:7233"
depends_on:
- temporal
networks:
scylla-test:
volumes:
scylla-data:
Maybe it is specific to Scylla? I get it with this
docker-compose.yml
and a 55kb argument.version: "3.7" services: # ScyllaDB scylla: image: scylladb/scylla:4.2.0 container_name: scylla ports: - 0.0.0.0:9042:9042 networks: - scylla-test volumes: - scylla-data:/var/lib/scylla # Temporal, configured to use Scylla temporal: image: temporalio/auto-setup:${SERVER_TAG:-1.6.3} container_name: temporal ports: - "7233:7233" networks: - scylla-test volumes: - ./config/dynamicconfig:/etc/temporal/config/dynamicconfig environment: CASSANDRA_SEEDS: "scylla" CASSANDRA_PORT: "9042" DYNAMIC_CONFIG_FILE_PATH: "config/dynamicconfig/development.yaml" depends_on: - scylla # Temporal web portal - http://localhost:7230 temporal-web: image: temporalio/web:${WEB_TAG:-1.6.1} ports: - "7230:8088" networks: - scylla-test environment: - "TEMPORAL_GRPC_ENDPOINT=temporal:7233" - "TEMPORAL_PERMIT_WRITE_API=true" depends_on: - temporal # Temporal CLI tctl: image: temporalio/tctl:${SERVER_TAG:-1.6.3} networks: - scylla-test environment: - "TEMPORAL_CLI_ADDRESS=temporal:7233" depends_on: - temporal networks: scylla-test: volumes: scylla-data:
let me take a try
when i configure the test workflow to start a child workflow with 1MB payload, this is what is see (using the above docker compose)
"AppendHistoryNodes operation failed. Error: Batch too large"
scylla is reporting
scylla | ERROR 2021-02-09 00:51:59,929 [shard 3] BatchStatement - Batch of prepared statements for temporal.history_tree, temporal.history_node is of size 1051978, exceeding specified FAIL threshold of 51200 by 1000778.
@shaunco can we sync on slack? i may need to have little bit more information (Failed to get history on workflow or corrupted history event batch, eventID is not continuous
)
One more thing, we currently only officially support cassandra 3.11 / mysql 5.7 / postgresql 9.6
tried my local cassandra setup as well as default cassandra docker (both 3.11), seems that either batch_size_fail_threshold
carries a different meaning (cassandra vs scylla) or simply ignored
let us talk on slack.
update: seems that cassandra / scylla are behaving differently (need more confirmation from both sides) .
in cassandra the batch query in question is targeting at one partition (seem to be a guaranteed bahavior) in scylla the batch query in question is target at multiple partition (2)
^ means batch_size_fail_threshold
is not being evaluated in cassandra in this case
since right now scylla DB is not officially supported, i will close this ticket.
plz reopen this ticket if you have any comments / concern
Any chance of adding a feature like conductor has with a transparent payload storage? https://conductor.netflix.com/devguide/architecture/technicaldetails.html#external-payload-storage
Expected Behavior
If
ExecuteWorkflow
is called with args that serialize to larger than the max history size, a proper error should be returned from ExecuteWorkflow and the workflow history should reflect that error.Actual Behavior
A workflow execution is created with no history. Subsequent attempts to retrieve the workflow via
tctl
ortemporal-web
getFailed to get history on workflow
orcorrupted history event batch, eventID is not continuous
, and it appears that subsequent workflow executions in the same namespace (even for different workflows) get stuck behind the now corrupt workflow.We haven't yet found a way to manually clear this execution with no history through Temporal provided tools and end up having to manually clear it from the database.
Some
tctl
logs:Steps to Reproduce the Problem
Specifications