oap-project / gazelle_plugin

Native SQL Engine plugin for Spark SQL with vectorized SIMD optimizations.
Apache License 2.0
256 stars 76 forks source link

Running TPC-DS power test with Gazelle Plugin enabled on Google Cloud Dataproc has much lower performance than Dataproc Spark. #424

Open HongW2019 opened 3 years ago

HongW2019 commented 3 years ago

Describe the bug We've successfully run TPC-DS with Gazelle Plugin on Google Cloud Dataproc cluster. We choose the cluster with 1 master and 2 workers. Each worker contains 4 Vcores and 15GB DRAM. The data scale is 2GB. We found that the performance of Gazelle Plugin is much lower than Dataproc Spark. When we use HDFS as storage, the execution time of some queries is showed below:

Dataproc Spark Gazelle Plugin
q78 1.2min 2.1min
q9 1.1min 2.5min
q1-q99 2251s 5046s

and some config:

spark.executor.extraLibraryPath            /opt/benchmark-tools/oap/lib
spark.executorEnv.LD_LIBRARY_PATH          /opt/benchmark-tools/oap/lib
spark.executorEnv.LIBARROW_DIR             /opt/benchmark-tools/oap
spark.executorEnv.ARROW_LIBHDFS3_DIR       /opt/benchmark-tools/oap/lib
spark.executorEnv.CC                       /opt/benchmark-tools/oap/bin/gcc
spark.driver.extraLibraryPath              /opt/benchmark-tools/oap/lib

After we check the detailed execution time and query plan, we find the time of the stages about scanning tables of Gazelle Plugin is much longer than Dataproc Spark.

q9 on Dataproc Spark image

q9 on Gazelle Plugin image

HongW2019 commented 3 years ago

@weiting-chen @zhouyuan @zhixingheyi-tian Please help to track the issue. Thanks!

zwx109473 commented 3 years ago

@weiting-chen @zhouyuan @zhixingheyi-tian Please help to track the issue. Thanks!

我测试10GB tpcds数据也发现同样问题 使用插件后查询速度下降2-3倍,主要时间花费在ColumnarBatchScan,能说明下吗?

zhztheplayer commented 3 years ago

Although this can be complicated to say why Arrow Data Source looks slower than Parquet Data Source without deeply diving into the execution process, but we can tell that a major difference between the two implementation is about String dictionary. In vanilla Spark, Parquet Data Source reads data directly without decoding dictionaries where Arrow Data Source always does. IIRC this can always be the reason in most of the cases about performance gap similar to this one we used to observe.

If this is the reason then within using non-dictionary data we may get different results that can show something that should be closer to the actual reading performance between the two Data Sources.