Open HongW2019 opened 3 years ago
@weiting-chen @zhouyuan @zhixingheyi-tian Please help to track the issue. Thanks!
@weiting-chen @zhouyuan @zhixingheyi-tian Please help to track the issue. Thanks!
我测试10GB tpcds数据也发现同样问题 使用插件后查询速度下降2-3倍,主要时间花费在ColumnarBatchScan,能说明下吗?
Although this can be complicated to say why Arrow Data Source looks slower than Parquet Data Source without deeply diving into the execution process, but we can tell that a major difference between the two implementation is about String dictionary. In vanilla Spark, Parquet Data Source reads data directly without decoding dictionaries where Arrow Data Source always does. IIRC this can always be the reason in most of the cases about performance gap similar to this one we used to observe.
If this is the reason then within using non-dictionary data we may get different results that can show something that should be closer to the actual reading performance between the two Data Sources.
Describe the bug We've successfully run TPC-DS with Gazelle Plugin on Google Cloud Dataproc cluster. We choose the cluster with 1 master and 2 workers. Each worker contains 4 Vcores and 15GB DRAM. The data scale is 2GB. We found that the performance of Gazelle Plugin is much lower than Dataproc Spark. When we use HDFS as storage, the execution time of some queries is showed below:
and some config:
After we check the detailed execution time and query plan, we find the time of the stages about scanning tables of Gazelle Plugin is much longer than Dataproc Spark.
q9 on Dataproc Spark
q9 on Gazelle Plugin