prestodb / presto

The official home of the Presto distributed SQL query engine for big data
http://prestodb.io
Apache License 2.0
16.08k stars 5.39k forks source link

Why does 'order by' need more memory than entire dataset size? #13846

Open tooptoop4 opened 4 years ago

tooptoop4 commented 4 years ago

Got below error after running 'select * from hive.redact order by time_stamp desc' Data is hudi parquet (hive - STORED AS INPUTFORMAT 'com.uber.hoodie.hadoop.HoodieInputFormat' ) on s3.

But entire dataset is only 3.5GB (85 columns wide), why does OrderByOperator need > 26GB ??

com.facebook.presto.ExceededMemoryLimitException: Query exceeded per-node user memory limit of 27GB [Allocated: 27.00GB, Delta: 1.22MB, Top Consumers: {OrderByOperator=26.98GB, MergeSortedPages=17.09MB, ScanFilterAndProjectOperator=146.26kB}] at com.facebook.presto.ExceededMemoryLimitException.exceededLocalUserMemoryLimit(ExceededMemoryLimitException.java:40) at com.facebook.presto.memory.QueryContext.enforceUserMemoryLimit(QueryContext.java:329) at com.facebook.presto.memory.QueryContext.updateUserMemory(QueryContext.java:136) at com.facebook.presto.memory.QueryContext$QueryMemoryReservationHandler.reserveMemory(QueryContext.java:310) at com.facebook.presto.memory.context.RootAggregatedMemoryContext.updateBytes(RootAggregatedMemoryContext.java:37) at com.facebook.presto.memory.context.ChildAggregatedMemoryContext.updateBytes(ChildAggregatedMemoryContext.java:38) at com.facebook.presto.memory.context.ChildAggregatedMemoryContext.updateBytes(ChildAggregatedMemoryContext.java:38) at com.facebook.presto.memory.context.ChildAggregatedMemoryContext.updateBytes(ChildAggregatedMemoryContext.java:38) at com.facebook.presto.memory.context.ChildAggregatedMemoryContext.updateBytes(ChildAggregatedMemoryContext.java:38) at com.facebook.presto.memory.context.SimpleLocalMemoryContext.setBytes(SimpleLocalMemoryContext.java:66) at com.facebook.presto.operator.OperatorContext$InternalLocalMemoryContext.setBytes(OperatorContext.java:598) at com.facebook.presto.util.MergeSortedPages.lambda$pageWithPositions$4(MergeSortedPages.java:148) at com.facebook.presto.operator.WorkProcessorUtils.lambda$null$1(WorkProcessorUtils.java:190) at java.util.Optional.map(Optional.java:215) at com.facebook.presto.operator.WorkProcessorUtils.lambda$flatMap$2(WorkProcessorUtils.java:190) at com.facebook.presto.operator.WorkProcessorUtils$3.process(WorkProcessorUtils.java:262) at com.facebook.presto.operator.WorkProcessorUtils$ProcessWorkProcessor.process(WorkProcessorUtils.java:315) at com.facebook.presto.operator.WorkProcessorUtils$3.process(WorkProcessorUtils.java:249) at com.facebook.presto.operator.WorkProcessorUtils$ProcessWorkProcessor.process(WorkProcessorUtils.java:315) at com.facebook.presto.operator.WorkProcessorUtils$2.process(WorkProcessorUtils.java:124) at com.facebook.presto.operator.WorkProcessorUtils$ProcessWorkProcessor.process(WorkProcessorUtils.java:315) at com.facebook.presto.operator.WorkProcessorUtils$3.process(WorkProcessorUtils.java:249) at com.facebook.presto.operator.WorkProcessorUtils$ProcessWorkProcessor.process(WorkProcessorUtils.java:315) at com.facebook.presto.operator.WorkProcessorUtils$3.process(WorkProcessorUtils.java:249) at com.facebook.presto.operator.WorkProcessorUtils$ProcessWorkProcessor.process(WorkProcessorUtils.java:315) at com.facebook.presto.operator.exchange.LocalMergeSourceOperator.getOutput(LocalMergeSourceOperator.java:164) at com.facebook.presto.operator.Driver.processInternal(Driver.java:379) at com.facebook.presto.operator.Driver.lambda$processFor$8(Driver.java:283) at com.facebook.presto.operator.Driver.tryWithLock(Driver.java:675) at com.facebook.presto.operator.Driver.processFor(Driver.java:276) at com.facebook.presto.execution.SqlTaskExecution$DriverSplitRunner.processFor(SqlTaskExecution.java:1077) at com.facebook.presto.execution.executor.PrioritizedSplitRunner.process(PrioritizedSplitRunner.java:162) at com.facebook.presto.execution.executor.TaskExecutor$TaskRunner.run(TaskExecutor.java:483) at com.facebook.presto.$gen.Presto_0_220_dirty__0_220____20191211_061749_1.run(Unknown Source) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748)

image

EXPLAIN ANALYZE VERBOSE output:

Fragment 1 [ROUND_ROBIN] CPU: 59.84s, Scheduled: 1.04m, Input: 4650726 rows (3.17GB); per task: avg.: 4650726.00 std.dev.: 0.00, Output: 4650726 rows (3.17GB) Output layout: [redact_cols] Output partitioning: SINGLE [] Stage Execution Strategy: UNGROUPED_EXECUTION

Fragment 2 [SOURCE] CPU: 1.40m, Scheduled: 1.56m, Input: 4650726 rows (3.66GB); per task: avg.: 4650726.00 std.dev.: 0.00, Output: 4650726 rows (3.17GB) Output layout: [redact_cols] Output partitioning: ROUND_ROBIN [] Stage Execution Strategy: UNGROUPED_EXECUTION

arhimondr commented 4 years ago

@tooptoop4 Could you please share the entire EXPLAIN ANALYZE output (including the Fragment 0)?

tooptoop4 commented 4 years ago

@arhimondr that is the entire EXPLAIN ANALYZE output, there is no Fragment 0!

tooptoop4 commented 4 years ago

@vinothchandar @bhasudha are u aware of hudi needing loads of memory?

vinothchandar commented 4 years ago

Presto/Hudi queries perform just any a plain parquet table.. Hudi just plugs during planning phase and helps decide what files to scan.. Other than that, what you see must be a pure presto issue..

arhimondr commented 4 years ago

@tooptoop4 Looks like the Fragment 0 is missing, as it is the fragment that the query failed on. I think this is a bug that was introduced recently (failed fragment is not included in the plan).

The other weird thing (I assume) that for some reason after N way marge the order by operator is applied. It should be either - either, but not both.

tooptoop4 commented 4 years ago

prestoSQL v328 also has the issue: Query exceeded per-node user memory limit of 26GB [Allocated: 26.00GB, Delta: 53.54MB, Top Consumers: {OrderByOperator=20.97GB, MergeSortedPages=5.02GB, GenericSpiller=620kB}]

rubenssoto commented 3 years ago

I dont know if this issue was solved on prestodb, but was already solved on trino, the same solution could be applied here?

https://github.com/trinodb/trino/pull/2612