trinodb / trino

Official repository of Trino, the distributed SQL query engine for big data, formerly known as PrestoSQL (https://trino.io)
https://trino.io
Apache License 2.0
10.42k stars 3k forks source link

Iceberg commit retries #9582

Open findepi opened 3 years ago

lvoinea commented 1 year ago

Is this related to https://iceberg.apache.org/spec/#commit-conflict-resolution-and-retry? Any indication about when this will be implemented?

Milias commented 2 months ago

Hello,

I cannot find any other issue or anything else when searching for "trino retry iceberg commit error" besides this one, so I hope it makes sense!

I have configured a Trino deployment on Kubernetes with task retries to try to speed up concurrent writes to an iceberg table from multiple queries executed at once. I see that the configuration is correct because there are files being created in the location specified for the exchange manager. I have also set the following configuration in the helm chart:

additionalConfigProperties:
- retry-policy=TASK
- retry-attempts-per-task=20
- tolerant-execution-standard-split-size=128MB

The exchange manager is also configured separately.

What I see however is that when two queries are executed at once, error EXTERNAL ERROR — ICEBERG_COMMIT_ERROR is raised and the query stops completely.

From the documentation on fault-tolerant execution I would have expected that only the failing task is re-executed instead of the whole query showing as failure. This is a long-ish query, takes ~1 minute to execute, while the task that fails is only at the end, taking ~500ms. I was hoping that by setting up task retries Trino would instead re-execute only this last task instead of the whole query.

Is this a wrong assumption? Could you please help me understand better what is the expected behavior?