Trino Query getting Hung after reading 400~600M rows (10~12GB data) of Parquet from object storage

pavan-upadhyay commented 5 months ago

We have 5 worker nodes (8vcpu/32GB) and a coordinator (6vcpu/32GB) in a cluster with trino 389 installed on it.

We are trying to run a query to fetch records from a unpartitioned table having 3B records in it and the query gets Hung after reading 400~600M rows or 10~12GB of data. Below is the worker node log after enabling debug in trino log property

2024-05-09T19:45:22.093Z INFO 20240509_194405_00003_p8yig.5.0.0-131-58 io.trino.plugin.hive.oci.OciConfigurationProvider Updating extra credentials in the configuration. 2024-05-09T19:45:22.093Z INFO 20240509_194405_00003_p8yig.5.0.0-131-58 com.oracle.bmc.hdfs.store.BmcDataStore Getting metadata for key: 'Data/sch/l0/gl_master_parquet/42495e11a8b53e3b-af7c1a4800000000_282281672_data.59.parq' 2024-05-09T19:45:22.096Z DEBUG 20240509_194405_00003_p8yig.5.0.0-94-55 io.trino.execution.executor.TaskExecutor Split 20240509_194405_00003_p8yig.5.0.0-94 {path=oci://test@tenancy/data_parquet/42495e11a8b53e3b-af7c1a4800000000_282281672_data.381.parq, start=0, length=67108864, estimatedFileSize=251381904, hosts=[], database=rlsf_prod, table=gl_master_parquet, forceLocalScheduling=false, partitionName=, deserializerClassName=org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe, s3SelectPushdownEnabled=false, splitNumber=407} (start = 2.527843739399534E9, wall = 7784 ms, cpu = 1580 ms, wait = 1162 ms, calls = 6) is finished 2024-05-09T19:45:22.098Z INFO 20240509_194405_00003_p8yig.5.0.0-132-55 io.trino.plugin.hive.oci.OciConfigurationProvider Updating extra credentials in the configuration. 2024-05-09T19:45:22.098Z INFO 20240509_194405_00003_p8yig.5.0.0-132-55 com.oracle.bmc.hdfs.store.BmcDataStore Getting metadata for key: 'Data/sch/l0/gl_master_parquet/42495e11a8b53e3b-af7c1a4800000000_282281672_data.179.parq' 2024-05-09T19:45:33.083Z DEBUG task-notification-2 io.trino.execution.TaskStateMachine Task 20240509_194405_00003_p8yig.5.0.0 is ABORTED 2024-05-09T19:45:33.085Z WARN 20240509_194405_00003_p8yig.5.0.0-86-76 com.oracle.bmc.hdfs.store.BmcParallelReadAheadFSInputStream Read operation interrupted, retrying... 2024-05-09T19:45:33.086Z DEBUG task-notification-4 io.trino.execution.TaskStateMachine Task 20240509_194405_00003_p8yig.3.2.0 is ABORTED

Below are the JVM configs -server -Xmx25G -XX:-UseBiasedLocking -XX:+UseG1GC -XX:G1HeapRegionSize=32M -XX:+ExplicitGCInvokesConcurrent -XX:+ExitOnOutOfMemoryError -XX:+HeapDumpOnOutOfMemoryError -XX:-OmitStackTraceInFastThrow -XX:ReservedCodeCacheSize=512M -XX:PerMethodRecompilationCutoff=10000 -XX:PerBytecodeRecompilationCutoff=10000 -Djdk.attach.allowAttachSelf=true -Djdk.nio.maxCachedBufferSize=2000000 -Dlogback.configurationFile=/etc/trino/conf/trino-ranger-plugin-logback.xml

Trino Config below: query.max-memory=25GB task.max-worker-threads=32 http-server.http.port=8285 query.max-concurrent-queries=100 node-scheduler.include-coordinator=false query.execution-policy=phased query.client.timeout=10m http-server.https.enabled=false query.max-memory-per-node=5GB task.concurrency=8 coordinator=true

Could you please help me know what additional configurations are needed or any clue on existing settings. Thanks in Advance.

tlm365 commented 5 months ago

@pavan-upadhyay I suggest increasing the query.max-memory-per-node config to 15Gb or more for experimental. Because your table is unpartitioned, so maybe it needs to load onto one node instead of sharing it across 5 nodes.

However, for long-term I still recommend that this table should be rewritten with an optimal partition.

Edit: Don't forget to specify SELECT necessary, columns instead of SELECT * which is also a simple and effective solution.

pavan-upadhyay commented 5 months ago

Hi @tlm365 , Thanks for the response!! i did try with -Xmx28G , query.max-memory-per-node=18G config but still the same issue. My query does have column names in SELECT instead of SELECT * . The query which I am testing in Trino, is working fine in Impala with same cluster configurations.

pavan-upadhyay commented 5 months ago

Hi @tlm365 .. this time i set enough memory including JVM, However , it still hangs after 600M rows and our Reserved Memory goes down to 100 MB from 19 GB.

image (3) image (1)

trinodb / trino

Trino Query getting Hung after reading 400~600M rows (10~12GB data) of Parquet from object storage #21905