greenlion / warp

WarpSQL Server, an open source OLAP focused distribution of the world's most popular open source database bundled with OLAP performance related plugins such as the WARP storage engine..
http://warpsql.blog
Other
42 stars 2 forks source link

Support parallel query and/or vectorized execution (for join operations) #73

Closed twocode closed 2 years ago

greenlion commented 3 years ago

I just merged in the parallel query branch. This is still an alpha feature. It currently works only with SUM and COUNT and doesn't support positional group by. But you can test it if you want.

I am going to enable the github wiki page for WARP and add star schema benchmark results from my machine compared to MariaDB ColumnStore. I am working on some bug fixes while I test the benchmark queries for correctness and performance.

greenlion commented 3 years ago

Use the parallel_query branch for now. I mucked the 8.0 branch up slightly and need to fix it.

To use parallel query you must compile the software, do make install, update your /etc/ld.so.conf.d/warp to point to the install location lib/plugin directory, and install /usr/local/mysql/share/warp_parallel_install.sql to load the plugin. You must enable the event scheduler and you must set warp_rewriter_parallel_query=ON;

greenlion commented 3 years ago

the 8.0 branch is fixed. now has parallel query(follow above instructions) and is updated to 8.0.24

greenlion commented 3 years ago

https://twitter.com/justin_swanhart/status/1397280065948372997?s=20

greenlion commented 2 years ago

Parallel query no longer requires a plugin. A parallel hash join is now built into the server.

On the parallel_hash_join branch (which is not merged yet) this is the results on the Star Schema Benchmark at scale factor 200. There are approximately 130GB of data in the schema. This is query 4.1 of the benchmark (modified to be only project COUNT(*)), which is generally the most intensive query (least number of filtered rows) on the benchmark:

SSB SF=200 WARP_CACHE_SIZE=45GB SYS_MEMORY=64G CPU=i7-9750H[6 core/12 thread] HDD=M2.1 NVME SAMSUNG 860 1TB [formatted to 512GB] Oracle Linux 8.3 WARP 8.0.25

mysql> explain analyze select straight_join count(*) cnt  from lineorder join dim_date on lo_orderdatekey = d_datekey join customer   on lo_custkey = c_customerkey join supplier   on lo_suppkey = s_suppkey join
part   on lo_partkey = p_partkey where c_region = 'AMERICA' and s_region = 'AMERICA' and (p_mfgr = 'MFGR#1' or p_mfgr = 'MFGR#2') \G
*************************** 1. row ***************************
EXPLAIN: -> Aggregate: count(0)  (actual time=129414.755..129414.756 rows=1 loops=1)
    -> Inner hash join (part.P_PartKey = lineorder.LO_PartKey)  (cost=770347346895597700000.00 rows=-9223372036854775808) (actual time=122894.335..128662.361 rows=21217772 loops=1)
        -> Filter: ((part.P_MFGR = 'MFGR#1') or (part.P_MFGR = 'MFGR#2'))  (cost=0.02 rows=57476) (actual time=0.033..476.453 rows=479550 loops=1)
            -> Table scan on part  (cost=0.02 rows=1600000) (actual time=0.028..375.251 rows=479550 loops=1)
        -> Hash
            -> Inner hash join (supplier.S_SuppKey = lineorder.LO_SuppKey)  (cost=659007452416762900.00 rows=132489760822319664) (actual time=114594.147..118566.712 rows=21217772 loops=1)
                -> Filter: (supplier.S_Region = 'AMERICA')  (cost=0.00 rows=8094) (actual time=0.030..102.122 rows=79993 loops=1)
                    -> Table scan on supplier  (cost=0.00 rows=400000) (actual time=0.026..84.712 rows=79993 loops=1)
                -> Hash
                    -> Inner hash join (customer.C_CustomerKey = lineorder.LO_CustKey)  (cost=4075897318439544.00 rows=808913619574198) (actual time=107324.619..111349.870 rows=21217772 loops=1)
                        -> Filter: (customer.C_Region = 'AMERICA')  (cost=0.02 rows=119088) (actual time=0.029..753.747 rows=793864 loops=1)
                            -> Table scan on customer  (cost=0.02 rows=6000000) (actual time=0.023..639.769 rows=793864 loops=1)
                        -> Hash
                            -> Inner hash join (dim_date.D_DateKey = lineorder.LO_OrderDateKey)  (cost=342364304601.10 rows=342230416738) (actual time=100704.620..103042.530 rows=21217772 loops=1)
                                -> Table scan on dim_date  (cost=0.00 rows=2556) (actual time=1.530..3.809 rows=2556 loops=1)
                                -> Hash
                                    -> Table scan on lineorder  (cost=133892962.30 rows=1338929623) (actual time=45553.287..98480.088 rows=21217772 loops=1)

1 row in set (2 min 9.90 sec)

The warp_cache_size is set to 45GB. WARP is using 12 threads to execute the query. It is still slower than ColumnStore using the current code, mostly because I have to force MySQL to hash the fact table right now, due to the way the storage engine interface works, and due to bugs in MySQL: https://bugs.mysql.com/bug.php?id=105308

Even if the data fits into memory, using InnoDB single threaded query, forcing hash join, I killed the query after 3 hours. So I think 3 minutes is acceptable for now :1st_place_medal:

greenlion commented 2 years ago

Note this query also shows off star schema optimization. Notice that only 21M rows are hashed by MySQL for the fact, and the dimensions only iterate over rows that match the join. A normal MySQL hash join would have to access the entire lineorder table serially because there are no indexes that MySQL can see. Bitmap indexes are used for filtering rows, but not currently used by the join.

Adding BRIN style indexes (which is fairly trivial for a column store) may be needed to get performance to match ColumnStore [as well as MySQL not hashing the fact table]. Both fixes will be added soon.

greenlion commented 2 years ago

With warp_cache_size=55G query completes in 1.5 minutes. With warp_cache_size=4GB query takes 3.5 minutes. This is mostly because the join columns have been pushed out of the cache by the time that iteration of rows starts. In order to improve performance with smaller cache sizes, iteration of a partition should begin as soon as the join finishes for the partition instead of buffering rownum values until all partitions have been iterated over. That complicates the synchronization logic though, so it is lower priority, since if you have a 130GB data set, you should have enough money to buy a machine with a sufficiently large cache, and 3.5 minutes for a 130GB data set actually is not bad at all compared to regular MySQL performance which will be REALLY BAD even with hash join.

greenlion commented 2 years ago

fixed in parallel_hash_join thread