trinodb / trino

Official repository of Trino, the distributed SQL query engine for big data, formerly known as PrestoSQL (https://trino.io)
https://trino.io
Apache License 2.0
10.27k stars 2.95k forks source link

GROUP BY column list order affects runtime #5553

Open tooptoop4 opened 3 years ago

tooptoop4 commented 3 years ago

Qry1: --takes 1min (UI json shows "cumulativeUserMemory" : 3.3699831825E11, "totalScheduledTime" : "19.26m","peakTaskTotalMemory" : "57267780B", "info" : { "@type" : "exchangeClientStatus", "bufferedBytes" : 0, "maxBufferedBytes" : 485348, "averageBytesPerRequest" : 12007, "successfulRequestsCount" : 1264, "bufferedPages" : 0, "noMoreLocations" : true, "pageBufferClientStatuses" : [ ] }

SELECT a,b,c,d,e,
sum(value_double) as value FROM s.t
WHERE d BETWEEN TIMESTAMP '2020-03-01' AND TIMESTAMP '2020-08-11'  
AND data_set_name IN ('e,'t')
GROUP BY d,b,a,e,c

Qry2: --takes 2mins (UI json shows "cumulativeUserMemory" : 1.058281153416E12, "totalScheduledTime" : "29.80m","peakTaskTotalMemory" : "105336700B", "info" : { "@type" : "exchangeClientStatus", "bufferedBytes" : 0, "maxBufferedBytes" : 485211, "averageBytesPerRequest" : 5506, "successfulRequestsCount" : 2752, "bufferedPages" : 0, "noMoreLocations" : true, "pageBufferClientStatuses" : [ ] }

SELECT a,b,c,d,e,
sum(value_double) as value FROM s.t
WHERE d BETWEEN TIMESTAMP '2020-03-01' AND TIMESTAMP '2020-08-11'  
AND data_set_name IN ('e,'t')
GROUP BY a,b,c,d,e

from slack: " Tao Wang Apr 3rd at 3:58 AM It's possible to improve the performance of the GROUP BY function by carefully ordering a list of fields within the GROUP BY in an order of high cardinality I am a little confused about this tip. Based on my understanding, since the group by is implemented by hash in Presto, there should not be difference between different group-by order. Could someone shed light on this? Thanks! 15 replies

findepi 6 months ago Where this quote is from?

raunaqm 6 months ago I think this is mentioned by Athena and Treasure data in tuning guide. https://aws.amazon.com/blogs/big-data/top-10-performance-tuning-tips-for-amazon-athena/ https://support.treasuredata.com/hc/en-us/articles/360001450908-Presto-Performance-Tuning Amazon Web ServicesAmazon Web Services Top 10 Performance Tuning Tips for Amazon Athena | Amazon Web Services This blog post has been translated into Japanese. Amazon Athena is an interactive query service that makes it easy to analyze data stored in Amazon S3 using standard SQL. Athena is serverless, so there is no infrastructure to manage, and you pay only for the queries that you run. Athena is easy to use. Simply […] Mar 24th, 2017 (31 kB) https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2017/11/16/Top10.png Arm Treasure DataArm Treasure Data Presto Performance Tuning We have a new docs home, for this page visit our new documentation site! Prerequisites Basic knowledge of Arm Treasure Data. Basic knowledge of Presto query engine. ... (7 kB) https://theme.zdassets.com/theme_assets/766587/78b016adfce27707ee53e066d510017a2788bdc9.png

findepi 6 months ago I don’t know why the wrote so. Maybe that was based on some old version? @Tao Wang in general — this certainly doesn’t matter, as you pointed out, everything is hash-based in a less general case, when the data is already partitioned by a subset of columns, we do not repartition it. but that doesn’t depend on the order of columns in GROUP BY either.

Tao Wang 6 months ago Thanks @findepi and @raunaqm! I did see this quote from arm treasure data . The reason i have this question is i have large performance difference with different order. Below are the same queries with different group by order. (edited)

Tao Wang 6 months ago -- take about 10+ minutes select player_id_type, game_id_type, platform, release_type, source_type, country, mode_type, game_mode, game_type, game_difficulty, game_map, is_spender, is_new_player, subscription_type, event_source, cast(udf_enhanced_approx_set(player_id, cast(0.0115 as double)) as varbinary) as active_users_sketch, sum(event_count) as event_count, sum(session_duration_sum) as session_duration_sum, dt, pgroup, game_id from ${db}.${table} where dt='2020-03-16' group by 1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,19,20,21;

Tao Wang 6 months ago -- take less than 3 minutes select player_id_type, game_id_type, platform, release_type, source_type, country, mode_type, game_mode, game_type, game_difficulty, game_map, is_spender, is_new_player, subscription_type, event_source, cast(udf_enhanced_approx_set(player_id, cast(0.0115 as double)) as varbinary) as active_users_sketch, sum(event_count) as event_count, sum(session_duration_sum) as session_duration_sum, dt, pgroup, game_id from ${db}.${table} where dt='2020-03-16' group by 13,1,2,3,4,5,6,7,8,9,10,11,12,14,15,19,20,21;

findepi 6 months ago can you compare EXPLAIN plans for different orders and check what is the distribution (what columns are hashed) of indivudal fragments?

findepi 6 months ago since there is only TS + GROUP BY, there shouldn’t be any edge cases. but, the difference can be attributed to equals invocations when finding values within bucket in memory (let me checck) (edited)

Tao Wang 6 months ago Is it possible related to the hash collision?

findepi 6 months ago https://github.com/prestosql/presto/blob/fd5f4988df52dd49be6e95b681c0b87481ef4e4f/presto-main/src/main/java/io/prestosql/operator/MultiChannelGroupByHash.java#L286 presto-main/src/main/java/io/prestosql/operator/MultiChannelGroupByHash.java:286 if (positionNotDistinctFromCurrentRow(groupAddressByHash[hashPosition], hashPosition, position, page, (byte) rawHash, channels)) { https://github.com/prestosql/presto|prestosql/prestoprestosql/presto | Added by GitHub

findepi 6 months ago it’s not only about hash collisions per se (they are rare, if values are inequal), but it’s hash % size , to effective collisions are more frequent that should be filtered out by the rawHashByHashPosition[hashPosition] != rawHash in that method… (edited)

findepi 6 months ago I think there is something i am not seeing. @dain do you recall?

Tao Wang 6 months ago Thanks @findepi! The EXPLAIN plans are pretty the same except the different group-by column. Looking forward to dain's reply. Thanks all.

findepi 6 months ago different group-by column just in different order, right?

Tao Wang 6 months ago yes, you are right. just in different order "

ssquan commented 3 years ago

Maybe because of the combine_hash function when calculating the hash of columns in the GROUP BY clause. According to the source code, the hash value may overflow after the left value multiple 31.

    @ScalarFunction(value = "combine_hash", hidden = true)
    @SqlType(StandardTypes.BIGINT)
    public static long getHash(@SqlType(StandardTypes.BIGINT) long previousHashValue, @SqlType(StandardTypes.BIGINT) long value)
    {
        return (31 * previousHashValue + value);
    }
findepi commented 3 years ago

@ssquan good point. I guess this doesn't work good e.g. when the number of worker nodes is multiple of 31. Other scenarios?

sopel39 commented 3 years ago

In explain analyze verbose there should be hash collision stats for aggregation operator. Could you attach such plans?

tooptoop4 commented 3 years ago

Qry1 Collisions avg.: 2532.00 (17882.48% est.), Collisions std.dev.: 0.00%

vs

Qry2 Collisions avg.: 2534.00 (17896.61% est.), Collisions std.dev.: 0.00%

for both: Input: 23401616 rows (1.85GB); per task: avg.: 23401616.00 std.dev.: 0.00, Output: 667 rows (56.08kB)

sopel39 commented 3 years ago

17882.48% is a high number. But it happens in both cases. Can you consistently reproduce perf difference in both scenarios? Maybe it's mostly executing on sing node, which has high perf variablity.

tooptoop4 commented 3 years ago

consistently different number of collisions based on order of the columns