Open sopel39 opened 5 months ago
With "waiting for splits" approach we've managed to get it down to 10s
for 2 table scans, so 16s -> 10s
(no cache/cache) and also I’ve observed no rejected cache splits
The bigger query also ends up with 10,86s
of wall time, so pretty efficient
The test query:
select sum(orderkey / 10000), sum(partkey / 10000), sum(suppkey / 10000), sum(linenumber), sum(quantity) from hive.tpch_sf1000_dec_snappy_parquet_part.lineitem
union all
select sum(orderkey / 10000), sum(partkey / 10000), sum(suppkey / 10000), sum(linenumber), sum(quantity) from hive.tpch_sf1000_dec_snappy_parquet_part.lineitem
union all
select sum(orderkey / 10000), sum(partkey / 10000), sum(suppkey / 10000), sum(linenumber), sum(quantity) from hive.tpch_sf1000_dec_snappy_parquet_part.lineitem
union all
select sum(orderkey / 10000), sum(partkey / 10000), sum(suppkey / 10000), sum(linenumber), sum(quantity) from hive.tpch_sf1000_dec_snappy_parquet_part.lineitem
union all
select sum(orderkey / 10000), sum(partkey / 10000), sum(suppkey / 10000), sum(linenumber), sum(quantity) from hive.tpch_sf1000_dec_snappy_parquet_part.lineitem
union all
select sum(orderkey / 10000), sum(partkey / 10000), sum(suppkey / 10000), sum(linenumber), sum(quantity) from hive.tpch_sf1000_dec_snappy_parquet_part.lineitem
union all
select sum(orderkey / 10000), sum(partkey / 10000), sum(suppkey / 10000), sum(linenumber), sum(quantity) from hive.tpch_sf1000_dec_snappy_parquet_part.lineitem
Subquery
s1
ands2
might have common subquery extracted. However,s1
ands2
might have limited number of common splits. Therefore it doesn’t make sense for stages1
to wait for stages2
to be completed (and vice-versa). It makes more sense to have some kind of smart logic to order execution of common splits so that split with id is not processed at the same time bys1
ands2
. This could maybe be achieved by improvingCacheSplitSource
so that it’s aware of split sourcess1
ands2
and orders split execution in smart way.