Closed okayhooni closed 1 month ago
@okayhooni would it be possible for you to provide JVM crash dump if this was due to JVM crash ? Also, which version of JVM are you using ? If there isn't a JVM crash, but just high CPU followed by pod crash, then a JFR profile would help to see which code path specifically is a problem.
Thank you for quick answer..! We use Temurin JDK 22 (Temurin-22.0.1+8
)
Currently, we have not mount any persistent volume to worker pods for keeping JVM crash dump like thread dump, except heap dump file with HeapDumpOnOutOfMemoryError
jvm option. but this issue is not related to OOM, there is no JVM heap dump.
I suspect this issue isn't caused by a JVM crash but rather a CONTAINER crash. I've observed logs like below, after invoking the graceful shutdown API in the preStop
container lifecycle hook of our Trino worker pod, before re-starting container.
INFO http-worker-2253 io.trino.server.GracefulShutdownHandler Shutdown requested
...
INFO shutdown-handler-0 io.trino.server.GracefulShutdownHandler Waiting for all tasks to finish
...
WARN shutdown-handler-0 io.trino.server.GracefulShutdownHandler Timed out waiting for the life cycle to stop
...
INFO LifeCycleManager Shutdown Hook io.airlift.bootstrap.LifeCycleManager JVM is shutting down, cleaning up
I am afraid I couldn't promptly share the JFR profile with StartFlightRecording=event=cpu-load
jvm option, due to some other works with higher priority.. but I will!
I am afraid this issue might not be related to Parquet vectorized-decoding, as it is not consistently reproducible with deterministic way even under the same query replay conditions.
Sometimes, this issue does NOT occur in the staging environment only with Graviton nodes, for the same replay conditions.
However, I am unable to reproduce this issue when testing on x86 nodes or with the configuration parquet.experimental.vectorized-decoding=false
.
Sorry... I found this issue was reproduced even with parquet.experimental.vectorized-decoding=false
on the pod in Graviton nodes..
It may be induced by other reason..
@raunaqmorarka
I found this issue was just another side-effect of thread-per-driver scheduler, reported already on the issue 21512
When I disabled experimental.thread-per-driver-scheduler-enabled
, then this issue was removed..!
Context
Recently, we upgraded our Trino cluster from
v433
tov451
and encountered unexpected restarts of some Trino worker containers (within pods on our EKS cluster) with CPU surge metrics during heavy load situations, typically on weekday mornings when numerous queries were submitted by various regular batches and our Trino users across our company.I investigated the logs and metrics of these problematic pods but found no explicit clues to pinpoint the issue. Prior to termination, the worker logs were predominantly filled with entries from the Hive connector split runner, indicating read operations for Parquet files, as shown below:
While a CPU surge prior to container crash was suspected, as you know, Kubernetes and JVM do not enforce container kill/restart mechanisms based on CPU usage throttling, unlike memory usage. Hence, it seems unusual that containers would restart despite sufficient memory availability.
However, I noticed that Vectorized decoding in the Parquet reader has been introduced and enabled by default since
v448
. I suspect this feature might be the cause of the issue. Enabling this option (which is the same as the default setting) and replaying queries from Trino audit logs during the issue situation, successfully reproduced the issue on the Graviton2/3(arm64) nodes. To investigate further, I disabled this feature using parquet.experimental.vectorized-decoding=false with a query-replay tool under the same conditions, which resolved the issue. Additionally, I found that the issue did not occur when using only Intel(amd64) nodes (r7i, r6i) for worker pods, even when this feature was enabled.According to the PR description above, this feature is auto-enabled only on supported platforms: Vectorized decoding is used only when the preferred vector bit size for the current platform is at least 256 bits (enabled on x86 and Graviton 3 machines but not on Graviton 2 machines). Despite this, I experienced unexpected worker container crashes on both r6g (Graviton 2) and r7g (Graviton 3) when this feature was enabled by default.
our Trino deploying environment info. (w/ worker container crash issue)
r7g
,m7g
), Graviton 2(r6g
,m6g
), Intel(r7i
,r6i
,m7i
,m6i
)v451
with some customized codes for our use cases.Suggestion
As I am not an expert in hardware and CPUs, further investigation into this issue is challenging for me. However, it appears that this feature may not yet be stable (using java vector API on incubating stage). I propose temporarily changing the default value of
parquet.experimental.vectorized-decoding.enabled
tofalse
until the feature stabilizes enough to remove theexperimental
label. This change would help prevent other Trino users from encountering issues similar to mine.Related PRs
Related Issues
cc/ @raunaqmorarka