Open davseitsev opened 2 months ago
cc: @martint @findepi
@davseitsev appreciate detailed analysis, that's very helpful!
Thanks a lot for sharing your findings and a potential solution for the problem. We had a similar issue on our Trino cluster started after we upgraded Trino from version 426 to 442. Workers started to have exchange issues more frequently with the logs:
2024-04-25T10:24:55.409Z WARN node-state-poller-0 io.trino.metadata.RemoteNodeState Node state update request to https://xxx.ip:8081/v1/info/state has not returned in 10.02s
2024-04-25T10:24:55.410Z WARN node-state-poller-0 io.trino.metadata.RemoteNodeState Node state update request to https://xxx.ip:8081/v1/info/state has not returned in 10.02s
2024-04-25T10:24:58.184Z INFO Thread-47 io.airlift.bootstrap.LifeCycleManager JVM is shutting down, cleaning up
2024-04-25T10:24:58.185Z INFO Thread-42 io.airlift.bootstrap.LifeCycleManager JVM is shutting down, cleaning up
I see the experimental.thread-per-driver-scheduler-enabled
was enabled since version 438(https://github.com/trinodb/trino/pull/20451) so we decided to disable it and it helped with stabilizing the cluster. We still see some exchange issues but the number of impacted queries used to be 14x more before disabling the feature.
Chiming in for more feedback - we're seeing exactly the same symptoms here on an update from 436 to 444. Nodes are going down, queries failing left and right, and a very very unstable cluster altogether.
If it helps at all, we're also seeing a ~30% query speed slowdown on smaller benchmarks, before nodes go unresponsive. Running the benchmark for longer (not very long in the grand scheme of things, maybe 5 to 10 minutes at ~half our current prod concurrency) causes unresponsive nodes.
I can also confirm setting experimental.thread-per-driver-scheduler-enabled=false
fixes it completely: performance is restored to Trino 436 levels and nodes are stable.
Are you able to narrow it down to specific query shapes? It would help immensely to debug if we can get some examples that cause the cluster to fall over, or that perform more slowly.
@martint let me dig into our benchmark and see if I can narrow it down to specific shapes 👍
@martint well it's been a bit of a wild goose chase. I have a list of 100 queries that do cause the issue, but I've been unable to pinpoint the issue to any single query in particular. Happy to share this list in private if it helps.
All these queries:
@davseitsev are there any communalities with your workloads? 🤔
I can't imagine this issue affects every Trino workloads or there would have been much more noise about it after the 436 release.
Do the queries have joins, union, many stages? Does it happen if you load the cluster with any specific type of query, or only when you mix them up?
Do the queries have joins, union, many stages?
Yes on all three counts 😅
Does it happen if you load the cluster with any specific type of query, or only when you mix them up?
It appears to be the mixing, or at least some level of concurrency, as I cannot reproduce this by executing queries one-by-one sequentially.
Our clusters can sustain ~40 concurrent queries on average, and the problem appears with ~10 concurrent queries, so it's not like we're pushing the envelope in terms of hardware either.
After Trino upgrade from version 409 to 444 we started wo experience issues with stuck workers. They refuse HTTP requests or the requests start to hang forever and desappear from discovery service.
Logs are full of messages like:
Where
10.42.110.232
is an IP of problematic worker.Number of open file descriptors on the problematic worker dramatically increases:![image](https://github.com/trinodb/trino/assets/1793410/666a0ee8-d78f-4060-b779-78c6b1fba6ce)
Sometimes we reach open sockets limit and got a lot of exceptions in logs like this:
But it happens not always. On Trino 409 we had limit for open file descriptors 128K on workes. When we met this issue we increased the limit to 256K, but it doesn't help. This limit is exceeded very quickly.
Number of threads also start to increase:![image](https://github.com/trinodb/trino/assets/1793410/20265861-a805-45e2-a9d4-937273fb5fa7)
Thread dump shows that single random thread blocks many other threads. Example 1:![image](https://github.com/trinodb/trino/assets/1793410/1be3592a-b150-471d-af8d-da43f2d8bc32)
Blocking thread stack trace (Full thread dump trino-worker-D0409-T0940-25683.tdump.txt) :
Example 2:![image](https://github.com/trinodb/trino/assets/1793410/61d8ac6e-9691-46c5-a9c1-2c8fd406e46b)
Blocking thread stack trace (Full thread dump trino-worker-D0409-T0949-25683.tdump.txt):
We have changed value of
experimental.thread-per-driver-scheduler-enabled
tofalse
and now the issue is not reproduced, our cluster looks stable https://github.com/trinodb/trino/blob/b7a161a422850c35b992f3ba26ac1a2f8bc9eb54/core/trino-main/src/main/java/io/trino/execution/TaskManagerConfig.java#L110-L115 I will update if anything changes.Let me know if I can provide more information.