prestodb / presto

The official home of the Presto distributed SQL query engine for big data
http://prestodb.io
Apache License 2.0
16.04k stars 5.38k forks source link

Aggregation with structural type causes GC pressure and excessive cross region reference #9553

Closed wenleix closed 2 years ago

wenleix commented 6 years ago

We have seen the following two issues in production when performing aggregations with structural type with huge number of groups. There are two separate cases:

wenleix commented 6 years ago

For array_agg, one idea is store one huge flattened ArrayBlockBuilder for all groups, and we don't have per group ArrayBlockBuilder. We need to have a linked list for each group to construct the aggregated array per group. This is similar to the histogram aggregation improvement in https://github.com/prestodb/presto/pull/8918.

This idea should also be applicable to the following aggregation functions: map_agg, map_unoin and multimap_agg

wenleix commented 6 years ago

Similarly, for ARBITRARY , one idea is to store a huge flattened ArrayBlockBuilder for all groups. Another integer array needs to be maintained for the mapping from group id to the position in the ArrayBlockBuilder. Since for ARBITRARY, element only doesn't need to be updated per group, we don't need to handle update case.

While MIN_BY and MAX_BY can also use this method to store the flattened array for all groups, they need to handle updates for group. One way would be to always append to the huge flattened ArrayBlockBuilder and update the mapping from group id to position. However, this requires periodic compaction when the ArrayBlockBuilder wastes too much memory.

wenleix commented 6 years ago

Per discussion with @highker and @haozhun . There is an easier solution for ARBITRARY, MIN_BY, MAX_BY.

The issue is currently we use BlockState to store the aggregation state, which stores a Block for a structural type. The SingleMapBlock are crated via type.getObject(block, position)

https://github.com/prestodb/presto/blob/9a42ff18618d93e33a34fe9f2cdc173e8838ac95/presto-main/src/main/java/com/facebook/presto/operator/aggregation/ArbitraryAggregationFunction.java#L201-L207

Instead of slice the large block into SingleMapBlock, we should store the parent block and the position into BlockState. The same applies to Slice as well.

In a long term, we might want to be able to perform automatic compaction.

wenleix commented 6 years ago

Update: We have been able to reproduce the issue by a query contains a lot of array_agg and billions of groups. The symptom is we see excessive native memory usage (up to 20G). We find this is due to Remember Set with Native Memory Tracking enabled:

[0x00007f6340c1ea05] ArrayAllocator<unsigned long, (MemoryType)7>::allocate(unsigned long)+0x175
[0x00007f6340c1d2eb] BitMap::resize(unsigned long, bool)+0x6b
[0x00007f6340f183aa] OtherRegionsTable::add_reference(void*, int)+0x3ea
[0x00007f63411f1416] ObjArrayKlass::oop_oop_iterate_nv_m(oopDesc*, FilterOutOfRegionClosure*, MemRegion)+0x186
                             (malloc=7045856KB +7045856KB #880732 +880732)

[0x00007f6340c1ea05] ArrayAllocator<unsigned long, (MemoryType)7>::allocate(unsigned long)+0x175
[0x00007f6340c1d2eb] BitMap::resize(unsigned long, bool)+0x6b
[0x00007f6340f183aa] OtherRegionsTable::add_reference(void*, int)+0x3ea
[0x00007f6340f357ff] InstanceKlass::oop_oop_iterate_nv(oopDesc*, FilterOutOfRegionClosure*)+0x12f
                             (malloc=10433488KB +10433488KB #1304186 +1304186)

@nezihyigitbasi and @highker helped to find mitigations by tuning G1RSetSparseRegionEntries, as illustrated in https://medium.com/@milan.mimica/everybody-leaks-f210631f13ef . It can lower the native memory usage down to around 10G.

stale[bot] commented 4 years ago

This issue has been automatically marked as stale because it has not had any activity in the last 2 years. If you feel that this issue is important, just comment and the stale tag will be removed; otherwise it will be closed in 7 days. This is an attempt to ensure that our open issues remain valuable and relevant so that we can keep track of what needs to be done and prioritize the right things.

stale[bot] commented 2 years ago

This issue has been automatically marked as stale because it has not had any activity in the last 2 years. If you feel that this issue is important, just comment and the stale tag will be removed; otherwise it will be closed in 7 days. This is an attempt to ensure that our open issues remain valuable and relevant so that we can keep track of what needs to be done and prioritize the right things.