Slow InMemoryHasJoin construction

While optimizing settings for some query that involves joining two relatively and equally large tables we have noticed very poor broadcast join performance. It was painfully slow (around 5 minutes) and most of the time was spent with little to now CPU usage on worker nodes. I have tracked down performance issue to construction of InMemoryJoinHash and I was able to speed the query up (down to around ~2m20s) by setting set session task_hash_build_concurrency=4;. As expected still lots of time was spent during InMemoryJoinHash construction but now with 4 cores working out of 32 on the worker nodes. I was not able to increase concurrency further, because with higher value query was failing with out of memory message.

I think we could and should optimize this step a little bit.

CC @losipiuk @kokosing

After spending some time optimizing InMemoryJoinHash I see at least two more changes that could further improve join performance. One is to create specialized version of InMemoryJoinHash for single column joins using long java type just like it is done for aggregation - we could skip hashing this single column.

Secondly, we could think about avoiding double/triple hashing rows. Currently hashes are calculated many times all over again, using different hashing algorithms for different partitioning/bucketing in different places. Correct me if I'm wrong, but for example with simple join of two tables, rows are hashed three times:

First calculating rawHash using HashGenerator.hashPositioni to distribute data between nodes.
rawHash is hashed again in ParallelHashBuilder using murmurHash3 to distribute data within node between parallel instances InMemoryJoinHash.
InMemoryJoinHash calculates yet another hash by hashing rawHash using XxHash64.

I Imagine, that if there is also some aggregation after joining, there is yet another hash being calculated?

Maybe it would be more efficient to calculate one larger hash (64 bit?) and divide it's entropy between operators? For example ParallelHashBuilder could use first 6 bits of the hash, InMemoryJoinHash next 26 bits and aggregation remaining bits? I picked those values, because 2^6 could be reasonable max value for task_hash_build_concurrency, InMemoryJoinHash already can not handle more then 2^26 values (limitation of fastutil library). However those values could be calculated dynamically as well.

CC @losipiuk @kokosing @dain

prestodb / presto

Slow InMemoryHasJoin construction #4567