Describe the bug
I run the build and test command for gazelle_plugin 1.2, and got some errors.
code version as below:
1) arrow-4.0.0-oap-1.2.0-release.zip
2) gazelle_plugin-1.2.0-release.zip
errors as below:
RUN ABORTED
java.lang.NoSuchMethodError: org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$.mergeSchemasInParallel(Lscala/collection/immutable/Map;Lscala/collection/Seq;Lorg/apache/spark/sql/SparkSession;)Lscala/Option;
at org.apache.spark.sql.execution.datasources.parquet.ParquetUtils$.inferSchema(ParquetUtils.scala:107)
at org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat.inferSchema(ParquetFileFormat.scala:170)
at org.apache.spark.sql.execution.datasources.DataSource.$anonfun$getOrInferFileFormatSchema$11(DataSource.scala:208)
at scala.Option.orElse(Option.scala:447)
at org.apache.spark.sql.execution.datasources.DataSource.getOrInferFileFormatSchema(DataSource.scala:205)
at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:418)
at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:326)
at org.apache.spark.sql.DataFrameReader.$anonfun$load$3(DataFrameReader.scala:308)
at scala.Option.getOrElse(Option.scala:189)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:308)
As you can see, the error is in file /arrow-data-source/parquet/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala.
And, I found that the mergeSchemasInParallel method has 3 arguments. But, that is 4 arguments in vanilla spark 3.1.
I modified ParquetFileFormat.scala as below, then it can run without this error.
But, I got some other errors as below, Please check whether these errors are normal. I can send you a detailed log later.
Fall back to use row-based operators, error is last(value#22596)() is not supported in ColumnarAggregation, original sparkplan is class org.apache.spark.sql.execution.aggregate.HashAggregateExec(List(class org.apache.spark.sql.execution.streaming.StateStoreSaveExec))
Fall back to use row-based operators, error is variance(cast(a#51859 as double)) is not supported in ColumnarAggregation, original sparkplan is class org.apache.spark.sql.execution.aggregate.HashAggregateExec(List(class org.apache.spark.sql.execution.exchange.ShuffleExchangeExec))
20:59:34.377 WARN org.apache.spark.sql.execution.datasources.v2.arrow.SparkMemoryUtils: Detected leaked memory pool, size: 127976...
20:59:34.442 WARN org.apache.spark.sql.execution.datasources.v2.arrow.SparkMemoryUtils: Detected leaked memory pool, size: 127976...
20:57:03.655 ERROR org.apache.spark.sql.execution.streaming.MicroBatchExecution: Query [id = a188d962-aa09-487f-a6e6-1b2beacf0583, runId = 31cd986e-0306-451b-95db-115ca3800057] terminated with error
org.apache.spark.SparkException: Writing job aborted.
at org.apache.spark.sql.execution.datasources.v2.V2TableWriteExec.writeWithV2(WriteToDataSourceV2Exec.scala:388)
at org.apache.spark.sql.execution.datasources.v2.V2TableWriteExec.writeWithV2$(WriteToDataSourceV2Exec.scala:336)
To Reproduce
build and run the test for gazelle_plugin 1.2
build cmd: mvn clean package -DskipTests -Dcpp_tests=OFF -Dbuild_arrow=OFF -Darrow_root=/opt/build/arrow_install -Dcheckstyle.skip
cmd: mvn test -Dbuild_arrow=OFF -Darrow_root=/opt/build/arrow_install -Dcheckstyle.skip
Expected behavior
run test successfully, no errors.
Additional context
Add any other context about the problem here.
Describe the bug I run the build and test command for gazelle_plugin 1.2, and got some errors. code version as below: 1) arrow-4.0.0-oap-1.2.0-release.zip 2) gazelle_plugin-1.2.0-release.zip
errors as below: RUN ABORTED java.lang.NoSuchMethodError: org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$.mergeSchemasInParallel(Lscala/collection/immutable/Map;Lscala/collection/Seq;Lorg/apache/spark/sql/SparkSession;)Lscala/Option; at org.apache.spark.sql.execution.datasources.parquet.ParquetUtils$.inferSchema(ParquetUtils.scala:107) at org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat.inferSchema(ParquetFileFormat.scala:170) at org.apache.spark.sql.execution.datasources.DataSource.$anonfun$getOrInferFileFormatSchema$11(DataSource.scala:208) at scala.Option.orElse(Option.scala:447) at org.apache.spark.sql.execution.datasources.DataSource.getOrInferFileFormatSchema(DataSource.scala:205) at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:418) at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:326) at org.apache.spark.sql.DataFrameReader.$anonfun$load$3(DataFrameReader.scala:308) at scala.Option.getOrElse(Option.scala:189) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:308)
As you can see, the error is in file /arrow-data-source/parquet/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala. And, I found that the mergeSchemasInParallel method has 3 arguments. But, that is 4 arguments in vanilla spark 3.1. I modified ParquetFileFormat.scala as below, then it can run without this error.
diff ParquetFileFormat.scala ParquetFileFormat.scala.old
439d438 < parameters: Map[String, String], 455c454 < SchemaMergeUtils.mergeSchemasInParallel(sparkSession, parameters, filesToTouch, reader)
But, I got some other errors as below, Please check whether these errors are normal. I can send you a detailed log later.
Fall back to use row-based operators, error is last(value#22596)() is not supported in ColumnarAggregation, original sparkplan is class org.apache.spark.sql.execution.aggregate.HashAggregateExec(List(class org.apache.spark.sql.execution.streaming.StateStoreSaveExec)) Fall back to use row-based operators, error is variance(cast(a#51859 as double)) is not supported in ColumnarAggregation, original sparkplan is class org.apache.spark.sql.execution.aggregate.HashAggregateExec(List(class org.apache.spark.sql.execution.exchange.ShuffleExchangeExec)) 20:59:34.377 WARN org.apache.spark.sql.execution.datasources.v2.arrow.SparkMemoryUtils: Detected leaked memory pool, size: 127976... 20:59:34.442 WARN org.apache.spark.sql.execution.datasources.v2.arrow.SparkMemoryUtils: Detected leaked memory pool, size: 127976... 20:57:03.655 ERROR org.apache.spark.sql.execution.streaming.MicroBatchExecution: Query [id = a188d962-aa09-487f-a6e6-1b2beacf0583, runId = 31cd986e-0306-451b-95db-115ca3800057] terminated with error org.apache.spark.SparkException: Writing job aborted. at org.apache.spark.sql.execution.datasources.v2.V2TableWriteExec.writeWithV2(WriteToDataSourceV2Exec.scala:388) at org.apache.spark.sql.execution.datasources.v2.V2TableWriteExec.writeWithV2$(WriteToDataSourceV2Exec.scala:336)
To Reproduce build and run the test for gazelle_plugin 1.2 build cmd: mvn clean package -DskipTests -Dcpp_tests=OFF -Dbuild_arrow=OFF -Darrow_root=/opt/build/arrow_install -Dcheckstyle.skip cmd: mvn test -Dbuild_arrow=OFF -Darrow_root=/opt/build/arrow_install -Dcheckstyle.skip
Expected behavior run test successfully, no errors.
Additional context Add any other context about the problem here.