Closed zy2xu closed 1 week ago
This issue should be fixed as of Trino 423 as a result of a fix in https://github.com/trinodb/trino/pull/18175 - however, there is reportedly still a similar potential issue with stuck stages / tasks that seems to be specific to fault tolerant executions.
@pettyjamesm Can this be considered resolved? Or do you mean there's a FTE specific issue remaining?
This one can be closed. There were discussions about a similar issue in slack at the time, but I’ve lost track of whether there was a follow up fix or whether that issue still exists.
For FTE it seems it was addressed in https://github.com/trinodb/trino/pull/20021
We have a large Trino cluster running in a production system (more than 100 nodes). Occasionally, we have nodes that are overloaded or fail during a query execution. When this happens, we would see the following error messages from the coordinator
However, on regular occasions, these threads will get cleaned up after the query completely fails due the outage of a worker node, but on off occasions, we see that some of these threads don't get cleaned up and they keep spamming production logs every second.
The current workaround for us is to restart the coordinator, so these lingering threads gets cleaned up.
So is this a known problem? or has anyone else seen this?
The trino version we are using is 413
Below is a copy of our config values
Would appreciate any comments or help!