trinodb / trino

Official repository of Trino, the distributed SQL query engine for big data, formerly known as PrestoSQL (https://trino.io)
https://trino.io
Apache License 2.0
10.12k stars 2.93k forks source link

Enabling parquet.experimental.vectorized-decoding may induce unexpected worker container crashes/restarts on AWS Graviton2/3 instances #22727

Closed okayhooni closed 1 month ago

okayhooni commented 1 month ago

Context

Recently, we upgraded our Trino cluster from v433 to v451 and encountered unexpected restarts of some Trino worker containers (within pods on our EKS cluster) with CPU surge metrics during heavy load situations, typically on weekday mornings when numerous queries were submitted by various regular batches and our Trino users across our company.

I investigated the logs and metrics of these problematic pods but found no explicit clues to pinpoint the issue. Prior to termination, the worker logs were predominantly filled with entries from the Hive connector split runner, indicating read operations for Parquet files, as shown below:

...
SplitRunner-20240715_012419_00526_x5rv3.3.14.0-49-46004 org.apache.parquet.filter2.compat.FilterCompat  Filtering using predicate: and(userdefinedbyinstance(is_test, DomainUserDefinedPredicate{columnDescriptor=[is_test] optional int32 is_test (INTEGER(8,true)), columnDomain=[ SortedRangeSet[type=tinyint, ranges=1, {[0]}] ]}), userdefinedbyinstance(is_closed, DomainUserDefinedPredicate{columnDescriptor=[is_closed] optional int32 is_closed (INTEGER(8,true)), columnDomain=[ SortedRangeSet[type=tinyint, ranges=1, {[1]}] ]}))
...

While a CPU surge prior to container crash was suspected, as you know, Kubernetes and JVM do not enforce container kill/restart mechanisms based on CPU usage throttling, unlike memory usage. Hence, it seems unusual that containers would restart despite sufficient memory availability.

However, I noticed that Vectorized decoding in the Parquet reader has been introduced and enabled by default since v448. I suspect this feature might be the cause of the issue. Enabling this option (which is the same as the default setting) and replaying queries from Trino audit logs during the issue situation, successfully reproduced the issue on the Graviton2/3(arm64) nodes. To investigate further, I disabled this feature using parquet.experimental.vectorized-decoding=false with a query-replay tool under the same conditions, which resolved the issue. Additionally, I found that the issue did not occur when using only Intel(amd64) nodes (r7i, r6i) for worker pods, even when this feature was enabled.

According to the PR description above, this feature is auto-enabled only on supported platforms: Vectorized decoding is used only when the preferred vector bit size for the current platform is at least 256 bits (enabled on x86 and Graviton 3 machines but not on Graviton 2 machines). Despite this, I experienced unexpected worker container crashes on both r6g (Graviton 2) and r7g (Graviton 3) when this feature was enabled by default.

our Trino deploying environment info. (w/ worker container crash issue)

Suggestion

As I am not an expert in hardware and CPUs, further investigation into this issue is challenging for me. However, it appears that this feature may not yet be stable (using java vector API on incubating stage). I propose temporarily changing the default value of parquet.experimental.vectorized-decoding.enabled to false until the feature stabilizes enough to remove the experimental label. This change would help prevent other Trino users from encountering issues similar to mine.

Related PRs

Related Issues

cc/ @raunaqmorarka

raunaqmorarka commented 1 month ago

@okayhooni would it be possible for you to provide JVM crash dump if this was due to JVM crash ? Also, which version of JVM are you using ? If there isn't a JVM crash, but just high CPU followed by pod crash, then a JFR profile would help to see which code path specifically is a problem.

okayhooni commented 1 month ago

Thank you for quick answer..! We use Temurin JDK 22 (Temurin-22.0.1+8)

Currently, we have not mount any persistent volume to worker pods for keeping JVM crash dump like thread dump, except heap dump file with HeapDumpOnOutOfMemoryError jvm option. but this issue is not related to OOM, there is no JVM heap dump.

I suspect this issue isn't caused by a JVM crash but rather a CONTAINER crash. I've observed logs like below, after invoking the graceful shutdown API in the preStop container lifecycle hook of our Trino worker pod, before re-starting container.

INFO    http-worker-2253    io.trino.server.GracefulShutdownHandler Shutdown requested
...
INFO    shutdown-handler-0  io.trino.server.GracefulShutdownHandler Waiting for all tasks to finish
...
WARN    shutdown-handler-0  io.trino.server.GracefulShutdownHandler Timed out waiting for the life cycle to stop
...
INFO    LifeCycleManager Shutdown Hook  io.airlift.bootstrap.LifeCycleManager   JVM is shutting down, cleaning up

I am afraid I couldn't promptly share the JFR profile with StartFlightRecording=event=cpu-load jvm option, due to some other works with higher priority.. but I will!

okayhooni commented 1 month ago

I am afraid this issue might not be related to Parquet vectorized-decoding, as it is not consistently reproducible with deterministic way even under the same query replay conditions.

Sometimes, this issue does NOT occur in the staging environment only with Graviton nodes, for the same replay conditions.

However, I am unable to reproduce this issue when testing on x86 nodes or with the configuration parquet.experimental.vectorized-decoding=false.

okayhooni commented 1 month ago

Sorry... I found this issue was reproduced even with parquet.experimental.vectorized-decoding=false on the pod in Graviton nodes..

It may be induced by other reason..

okayhooni commented 1 month ago

@raunaqmorarka

I found this issue was just another side-effect of thread-per-driver scheduler, reported already on the issue 21512

When I disabled experimental.thread-per-driver-scheduler-enabled, then this issue was removed..!