oap-project / gazelle_plugin

Native SQL Engine plugin for Spark SQL with vectorized SIMD optimizations.
Apache License 2.0
256 stars 77 forks source link

[ORC] Encounter bitmap out of bound issue in evaluateFilter #557

Open zhixingheyi-tian opened 2 years ago

zhixingheyi-tian commented 2 years ago

Describe the bug When run TPCDS integration testing. Encounter below out of bound issue

Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 1.0 failed 4 times, most recent failure: Lost task 0.3 in stage 1.0 (TID 9) (vsr532 executor 1): max_bitmap_index 1920799 must be <= maxSupportedValue 65535 in selection vector
        at org.apache.arrow.gandiva.evaluator.JniWrapper.evaluateFilter(Native Method)
        at org.apache.arrow.gandiva.evaluator.Filter.evaluate(Filter.java:179)
        at org.apache.arrow.gandiva.evaluator.Filter.evaluate(Filter.java:131)
        at com.intel.oap.expression.ColumnarConditionProjector$$anon$1.hasNext(ColumnarConditionProjector.scala:241)
        at com.intel.oap.vectorized.CloseableColumnBatchIterator.hasNext(CloseableColumnBatchIterator.scala:47)
        at org.apache.spark.sql.execution.ColumnarBroadcastExchangeExec.$anonfun$relationFuture$2(ColumnarBroadcastExchangeExec.scala:107)
        at org.apache.spark.rdd.RDD.$anonfun$mapPartitions$2(RDD.scala:863)
        at org.apache.spark.rdd.RDD.$anonfun$mapPartitions$2$adapted(RDD.scala:863)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
        at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
        at org.apache.spark.scheduler.Task.run(Task.scala:131)
        at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:497)
        at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1439)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:500)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)
zhixingheyi-tian commented 2 years ago

By debugging,have figured out the cause was from Arrow:file_orc.cc

 Result<RecordBatchIterator> Execute() override {

...
Result<std::shared_ptr<RecordBatch>> Next() {
        if (i_ == num_stripes_) {
          return nullptr;
        }
        std::shared_ptr<RecordBatch> batch;
        // TODO (https://issues.apache.org/jira/browse/ARROW-14153)
        // pass scan_options_->batch_size
        return reader_->ReadStripe(i_++, included_fields_);
      }

...

}

Now ORC in Arrow dataset has not yet honored the ScanOptions batch_size option.

So the returned recordbatch size maybe > 65535

zhixingheyi-tian commented 2 years ago

cc @zhouyuan @zhztheplayer

zhouyuan commented 2 years ago

556 may help