trinodb / trino

Official repository of Trino, the distributed SQL query engine for big data, formerly known as PrestoSQL (https://trino.io)
https://trino.io
Apache License 2.0
10.49k stars 3.02k forks source link

Trino worker pods restarting on kubernetes #24219

Open sarthak-autodesk opened 16 hours ago

sarthak-autodesk commented 16 hours ago

I am running Trino on Kubernetes and for more intensive workloads the worker pods keep restarting. I have sufficient memory and I have verified that it is not getting OOM killed.

Pod events:

Events:
  Type     Reason     Age   From     Message
  ----     ------     ----  ----     -------
  Warning  Unhealthy  46m   kubelet  Liveness probe failed: Get "http://10.189.49.197:8080/v1/info": read tcp 10.189.48.39:42204->10.189.49.197:8080: read: connection reset by peer
  Warning  Unhealthy  46m   kubelet  Readiness probe failed: curl: (56) Recv failure: Connection reset by peer
Server is not responding to requests
  Normal   Pulled     46m                  kubelet  Successfully pulled image "204522078340.dkr.ecr.us-east-1.amazonaws.com/adp_presto:base-presto-latest" in 113ms (113ms including waiting). Image size: 1259653891 bytes.
  Normal   Pulled     23m                  kubelet  Successfully pulled image "204522078340.dkr.ecr.us-east-1.amazonaws.com/adp_presto:base-presto-latest" in 104ms (104ms including waiting). Image size: 1259653891 bytes.
  Warning  Unhealthy  4m4s (x3 over 117m)  kubelet  Liveness probe failed: Get "http://10.189.49.197:8080/v1/info": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
  Warning  Unhealthy  4m2s (x3 over 117m)  kubelet  Readiness probe failed: command "/usr/lib/trino/bin/health-check" timed out
  Normal   Pulling    4m1s (x5 over 15h)   kubelet  Pulling image "204522078340.dkr.ecr.us-east-1.amazonaws.com/adp_presto:base-presto-latest"
  Normal   Created    4m1s (x5 over 15h)   kubelet  Created container trino-worker
  Normal   Started    4m1s (x5 over 15h)   kubelet  Started container trino-worker
  Normal   Pulled     4m1s                 kubelet  Successfully pulled image "204522078340.dkr.ecr.us-east-1.amazonaws.com/adp_presto:base-presto-latest" in 135ms (135ms including waiting). Image size: 1259653891 bytes.
  Warning  Unhealthy  3m49s (x3 over 46m)  kubelet  Readiness probe failed: Server is starting

I tried increasing the timeout for the liveness and readiness probe but that didn't help.

Logs of the pod just before crashing:

2024-11-21T19:53:53.169Z    WARN    http-client-node-manager-490    io.trino.metadata.RemoteNodeState   Error fetching node state from http://10.189.54.58:8080/v1/info/state: Server refused connection: http://10.189.54.58:8080/v1/info/state
2024-11-21T19:53:53.171Z    WARN    http-client-node-manager-594    io.trino.metadata.RemoteNodeState   Error fetching node state from http://10.189.54.44:8080/v1/info/state: Server refused connection: http://10.189.54.44:8080/v1/info/state
2024-11-21T19:53:53.375Z    WARN    http-client-node-manager-30 io.trino.metadata.RemoteNodeState   Error fetching node state from http://10.189.53.41:8080/v1/info/state: Failed communicating with server: http://10.189.53.41:8080/v1/info/state
2024-11-21T19:53:53.392Z    ERROR   page-buffer-client-callback-6   io.trino.operator.HttpPageBufferClient  Request to delete http://10.189.53.41:8080/v1/task/20241121_195337_01112_qk674.4.62.0/results/89 failed java.io.UncheckedIOException: Failed communicating with server: http://10.189.53.41:8080/v1/task/20241121_195337_01112_qk674.4.62.0/results/89
2024-11-21T19:53:53.393Z    ERROR   page-buffer-client-callback-21  io.trino.operator.HttpPageBufferClient  Request to delete http://10.189.53.41:8080/v1/task/20241121_195337_01112_qk674.5.55.0/results/89 failed java.io.UncheckedIOException: Failed communicating with server: http://10.189.53.41:8080/v1/task/20241121_195337_01112_qk674.5.55.0/results/89
2024-11-21T19:53:53.394Z    ERROR   page-buffer-client-callback-2   io.trino.operator.HttpPageBufferClient  Request to delete http://10.189.53.41:8080/v1/task/20241121_195337_01113_qk674.4.75.0/results/45 failed java.io.UncheckedIOException: Failed communicating with server: http://10.189.53.41:8080/v1/task/20241121_195337_01113_qk674.4.75.0/results/45
2024-11-21T19:53:53.396Z    ERROR   page-buffer-client-callback-5   io.trino.operator.HttpPageBufferClient  Request to delete http://10.189.53.41:8080/v1/task/20241121_195337_01113_qk674.5.55.0/results/45 failed java.io.UncheckedIOException: Failed communicating with server: http://10.189.53.41:8080/v1/task/20241121_195337_01113_qk674.5.55.0/results/45
2024-11-21T19:53:53.396Z    ERROR   page-buffer-client-callback-1   io.trino.operator.HttpPageBufferClient  Request to delete http://10.189.53.41:8080/v1/task/20241121_195337_01116_qk674.4.22.0/results/56 failed java.io.UncheckedIOException: Failed communicating with server: http://10.189.53.41:8080/v1/task/20241121_195337_01116_qk674.4.22.0/results/56
2024-11-21T19:53:53.397Z    ERROR   page-buffer-client-callback-4   io.trino.operator.HttpPageBufferClient  Request to delete http://10.189.53.41:8080/v1/task/20241121_195337_01116_qk674.5.53.0/results/56 failed java.io.UncheckedIOException: Failed communicating with server: http://10.189.53.41:8080/v1/task/20241121_195337_01116_qk674.5.53.0/results/56
2024-11-21T19:53:53.400Z    ERROR   page-buffer-client-callback-11  io.trino.operator.HttpPageBufferClient  Request to delete http://10.189.53.41:8080/v1/task/20241121_195337_01115_qk674.5.54.0/results/48 failed java.io.UncheckedIOException: Failed communicating with server: http://10.189.53.41:8080/v1/task/20241121_195337_01115_qk674.5.54.0/results/48
2024-11-21T19:53:53.400Z    ERROR   page-buffer-client-callback-9   io.trino.operator.HttpPageBufferClient  Request to delete http://10.189.53.41:8080/v1/task/20241121_195337_01115_qk674.4.7.0/results/48 failed java.io.UncheckedIOException: Failed communicating with server: http://10.189.53.41:8080/v1/task/20241121_195337_01115_qk674.4.7.0/results/48
2024-11-21T19:53:53.401Z    ERROR   page-buffer-client-callback-14  io.trino.operator.HttpPageBufferClient  Request to delete http://10.189.53.41:8080/v1/task/20241121_195338_01117_qk674.4.10.0/results/27 failed java.io.UncheckedIOException: Failed communicating with server: http://10.189.53.41:8080/v1/task/20241121_195338_01117_qk674.4.10.0/results/27
2024-11-21T19:53:53.401Z    ERROR   page-buffer-client-callback-13  io.trino.operator.HttpPageBufferClient  Request to delete http://10.189.53.41:8080/v1/task/20241121_195338_01117_qk674.5.54.0/results/27 failed java.io.UncheckedIOException: Failed communicating with server: http://10.189.53.41:8080/v1/task/20241121_195338_01117_qk674.5.54.0/results/27
2024-11-21T19:53:53.401Z    ERROR   page-buffer-client-callback-12  io.trino.operator.HttpPageBufferClient  Request to delete http://10.189.53.41:8080/v1/task/20241121_195337_01114_qk674.3.51.0/results/19 failed java.io.UncheckedIOException: Failed communicating with server: http://10.189.53.41:8080/v1/task/20241121_195337_01114_qk674.3.51.0/results/19
2024-11-21T19:53:53.401Z    ERROR   page-buffer-client-callback-18  io.trino.operator.HttpPageBufferClient  Request to delete http://10.189.53.41:8080/v1/task/20241121_195337_01114_qk674.2.26.0/results/19 failed java.io.UncheckedIOException: Failed communicating with server: http://10.189.53.41:8080/v1/task/20241121_195337_01114_qk674.2.26.0/results/19
/entrypoint/entrypoint.sh: line 44:    23 Killed                  /usr/lib/trino/bin/run-trino