I am running Trino on Kubernetes and for more intensive workloads the worker pods keep restarting. I have sufficient memory and I have verified that it is not getting OOM killed.
Pod events:
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning Unhealthy 46m kubelet Liveness probe failed: Get "http://10.189.49.197:8080/v1/info": read tcp 10.189.48.39:42204->10.189.49.197:8080: read: connection reset by peer
Warning Unhealthy 46m kubelet Readiness probe failed: curl: (56) Recv failure: Connection reset by peer
Server is not responding to requests
Normal Pulled 46m kubelet Successfully pulled image "204522078340.dkr.ecr.us-east-1.amazonaws.com/adp_presto:base-presto-latest" in 113ms (113ms including waiting). Image size: 1259653891 bytes.
Normal Pulled 23m kubelet Successfully pulled image "204522078340.dkr.ecr.us-east-1.amazonaws.com/adp_presto:base-presto-latest" in 104ms (104ms including waiting). Image size: 1259653891 bytes.
Warning Unhealthy 4m4s (x3 over 117m) kubelet Liveness probe failed: Get "http://10.189.49.197:8080/v1/info": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
Warning Unhealthy 4m2s (x3 over 117m) kubelet Readiness probe failed: command "/usr/lib/trino/bin/health-check" timed out
Normal Pulling 4m1s (x5 over 15h) kubelet Pulling image "204522078340.dkr.ecr.us-east-1.amazonaws.com/adp_presto:base-presto-latest"
Normal Created 4m1s (x5 over 15h) kubelet Created container trino-worker
Normal Started 4m1s (x5 over 15h) kubelet Started container trino-worker
Normal Pulled 4m1s kubelet Successfully pulled image "204522078340.dkr.ecr.us-east-1.amazonaws.com/adp_presto:base-presto-latest" in 135ms (135ms including waiting). Image size: 1259653891 bytes.
Warning Unhealthy 3m49s (x3 over 46m) kubelet Readiness probe failed: Server is starting
I tried increasing the timeout for the liveness and readiness probe but that didn't help.
Logs of the pod just before crashing:
2024-11-21T19:53:53.169Z WARN http-client-node-manager-490 io.trino.metadata.RemoteNodeState Error fetching node state from http://10.189.54.58:8080/v1/info/state: Server refused connection: http://10.189.54.58:8080/v1/info/state
2024-11-21T19:53:53.171Z WARN http-client-node-manager-594 io.trino.metadata.RemoteNodeState Error fetching node state from http://10.189.54.44:8080/v1/info/state: Server refused connection: http://10.189.54.44:8080/v1/info/state
2024-11-21T19:53:53.375Z WARN http-client-node-manager-30 io.trino.metadata.RemoteNodeState Error fetching node state from http://10.189.53.41:8080/v1/info/state: Failed communicating with server: http://10.189.53.41:8080/v1/info/state
2024-11-21T19:53:53.392Z ERROR page-buffer-client-callback-6 io.trino.operator.HttpPageBufferClient Request to delete http://10.189.53.41:8080/v1/task/20241121_195337_01112_qk674.4.62.0/results/89 failed java.io.UncheckedIOException: Failed communicating with server: http://10.189.53.41:8080/v1/task/20241121_195337_01112_qk674.4.62.0/results/89
2024-11-21T19:53:53.393Z ERROR page-buffer-client-callback-21 io.trino.operator.HttpPageBufferClient Request to delete http://10.189.53.41:8080/v1/task/20241121_195337_01112_qk674.5.55.0/results/89 failed java.io.UncheckedIOException: Failed communicating with server: http://10.189.53.41:8080/v1/task/20241121_195337_01112_qk674.5.55.0/results/89
2024-11-21T19:53:53.394Z ERROR page-buffer-client-callback-2 io.trino.operator.HttpPageBufferClient Request to delete http://10.189.53.41:8080/v1/task/20241121_195337_01113_qk674.4.75.0/results/45 failed java.io.UncheckedIOException: Failed communicating with server: http://10.189.53.41:8080/v1/task/20241121_195337_01113_qk674.4.75.0/results/45
2024-11-21T19:53:53.396Z ERROR page-buffer-client-callback-5 io.trino.operator.HttpPageBufferClient Request to delete http://10.189.53.41:8080/v1/task/20241121_195337_01113_qk674.5.55.0/results/45 failed java.io.UncheckedIOException: Failed communicating with server: http://10.189.53.41:8080/v1/task/20241121_195337_01113_qk674.5.55.0/results/45
2024-11-21T19:53:53.396Z ERROR page-buffer-client-callback-1 io.trino.operator.HttpPageBufferClient Request to delete http://10.189.53.41:8080/v1/task/20241121_195337_01116_qk674.4.22.0/results/56 failed java.io.UncheckedIOException: Failed communicating with server: http://10.189.53.41:8080/v1/task/20241121_195337_01116_qk674.4.22.0/results/56
2024-11-21T19:53:53.397Z ERROR page-buffer-client-callback-4 io.trino.operator.HttpPageBufferClient Request to delete http://10.189.53.41:8080/v1/task/20241121_195337_01116_qk674.5.53.0/results/56 failed java.io.UncheckedIOException: Failed communicating with server: http://10.189.53.41:8080/v1/task/20241121_195337_01116_qk674.5.53.0/results/56
2024-11-21T19:53:53.400Z ERROR page-buffer-client-callback-11 io.trino.operator.HttpPageBufferClient Request to delete http://10.189.53.41:8080/v1/task/20241121_195337_01115_qk674.5.54.0/results/48 failed java.io.UncheckedIOException: Failed communicating with server: http://10.189.53.41:8080/v1/task/20241121_195337_01115_qk674.5.54.0/results/48
2024-11-21T19:53:53.400Z ERROR page-buffer-client-callback-9 io.trino.operator.HttpPageBufferClient Request to delete http://10.189.53.41:8080/v1/task/20241121_195337_01115_qk674.4.7.0/results/48 failed java.io.UncheckedIOException: Failed communicating with server: http://10.189.53.41:8080/v1/task/20241121_195337_01115_qk674.4.7.0/results/48
2024-11-21T19:53:53.401Z ERROR page-buffer-client-callback-14 io.trino.operator.HttpPageBufferClient Request to delete http://10.189.53.41:8080/v1/task/20241121_195338_01117_qk674.4.10.0/results/27 failed java.io.UncheckedIOException: Failed communicating with server: http://10.189.53.41:8080/v1/task/20241121_195338_01117_qk674.4.10.0/results/27
2024-11-21T19:53:53.401Z ERROR page-buffer-client-callback-13 io.trino.operator.HttpPageBufferClient Request to delete http://10.189.53.41:8080/v1/task/20241121_195338_01117_qk674.5.54.0/results/27 failed java.io.UncheckedIOException: Failed communicating with server: http://10.189.53.41:8080/v1/task/20241121_195338_01117_qk674.5.54.0/results/27
2024-11-21T19:53:53.401Z ERROR page-buffer-client-callback-12 io.trino.operator.HttpPageBufferClient Request to delete http://10.189.53.41:8080/v1/task/20241121_195337_01114_qk674.3.51.0/results/19 failed java.io.UncheckedIOException: Failed communicating with server: http://10.189.53.41:8080/v1/task/20241121_195337_01114_qk674.3.51.0/results/19
2024-11-21T19:53:53.401Z ERROR page-buffer-client-callback-18 io.trino.operator.HttpPageBufferClient Request to delete http://10.189.53.41:8080/v1/task/20241121_195337_01114_qk674.2.26.0/results/19 failed java.io.UncheckedIOException: Failed communicating with server: http://10.189.53.41:8080/v1/task/20241121_195337_01114_qk674.2.26.0/results/19
/entrypoint/entrypoint.sh: line 44: 23 Killed /usr/lib/trino/bin/run-trino
I am running Trino on Kubernetes and for more intensive workloads the worker pods keep restarting. I have sufficient memory and I have verified that it is not getting OOM killed.
Pod events:
I tried increasing the timeout for the liveness and readiness probe but that didn't help.
Logs of the pod just before crashing: