trinodb / trino

Official repository of Trino, the distributed SQL query engine for big data, formerly known as PrestoSQL (https://trino.io)
https://trino.io
Apache License 2.0
10.49k stars 3.02k forks source link

Getting PAGE_TRANSPORT_TIMEOUT for any query that goes beyond 1 minute. #24149

Open ak2766 opened 5 days ago

ak2766 commented 5 days ago

TL;DR - Which timeout setting do I need to change from default to stop this PAGE_TRANSPORT_TIMEOUT and does it need to go in coordinator, worker, or both?

I'm experiment with Trino and DBeaver and I'm hitting a road block for queries that take longer than 1 minute to complete.

For instance, the query below in particular never finishes when using Trino cli. However, if I run the same query on SSMS, it completes in ~3 minutes. The table is >30GB

trino> SELECT id, REDACTED, REDACTED, REDACTED, count, REDACTED 
    -> from mssql.dbo.REDACTED 
    -> order by count desc
    -> limit 200;

Query 20241117_033406_00006_u26jx, FAILED, 1 node
Splits: 1 total, 0 done (0.00%)
1:07 [1 rows, 0B] [0 rows/s, 0B/s]

Query 20241117_033406_00006_u26jx failed: Encountered too many errors talking to a worker node. The node may have crashed or be under too much load. This is probably a transient issue, so please retry your query in a few minutes. (http://REDACTED:8080/v1/task/20241117_033406_00006_u26jx.0.0.0/results/0/0 - 104 failures, failure duration 60.02s, total failed request time 61.74s)

Initially, thought it was a DBeaver issue and I wasted hours researching timeouts. After exhausting all timeouts on DBeaver, I finally tried running the query directly on Trino's (which I really ought to have done first), I discovered the timeout is occurring somewhere inside Trino. I just started on Trino yesterday so I'm very green.

Here are my configs:

coordinator config: ``` coordinator=true node-scheduler.include-coordinator=false http-server.http.port=8080 discovery-server.enabled=true discovery.uri=http://trino-coordinator:8080 internal-communication.shared-secret=REDACTED internal-communication.https.required=false ```
worker config: ``` coordinator=false http-server.http.port=8080 discovery.uri=http://trino-coordinator:8080 internal-communication.shared-secret=REDACTED internal-communication.https.required=false ```

Quick EDIT: If I comment out the order by count desc, the query completes in under 5 seconds.

zachtrong commented 3 days ago

There are undocumented http-client config properties to increase timeout:

  workerExtraConfig: |-
    exchange.http-client.request-timeout=60s
    exchange.http-client.idle-timeout=2m 
    exchange.http-client.max-connections-per-server=1000
  coordinatorExtraConfig: |-
    exchange.http-client.request-timeout=60s
    exchange.http-client.idle-timeout=2m 
    exchange.http-client.max-connections-per-server=1000
ak2766 commented 3 days ago

Thanks @zachtron.

I went through the logs and searched for timeouts. Trial and error got me to the correct one and eventually got it going a day ago but forgot to come back and update.