mjakubowski84 / parquet4s

Read and write Parquet in Scala. Use Scala classes as schema. No need to start a cluster.
https://mjakubowski84.github.io/parquet4s/
MIT License
283 stars 65 forks source link

compatible parquet-hadoop with spark3.1 #325

Closed conderls closed 9 months ago

conderls commented 12 months ago

as the parquet-hadoop parquetVersion = "1.13.1" and the spark3.1(spark-sql) with version 1.10.1, there some conflicts exists and lead to Error

java.lang.NoSuchMethodError: org.apache.parquet.hadoop.util.HadoopInputFile.getPath()Lorg/apache/hadoop/fs/Path;
  at com.github.mjakubowski84.parquet4s.ParquetReader$BuilderImpl.read(ParquetReader.scala:85)
  at com.github.mjakubowski84.parquet4s.ParquetReader$BuilderImpl.read(ParquetReader.scala:77)

the HadoopInputFile.getPath() method is added since version 1.12.x, and it did nothing but return the path from FileStatus

  import org.apache.hadoop.fs.Path
  public Path getPath() {
    return stat.getPath();
  }
 ...
  public String toString() {
    return stat.getPath().toString();
  }

so, is it possible to make it more compatible by getting rid of calling HadoopInputFile.getPath(), and using new Path(HadoopInputFile.toString) instead?

mjakubowski84 commented 12 months ago

Hi @conderls! I would prefer not to add such complexity to the library.

Considering that parquet-hadoop in version 1.10.1 is five years old, I recommend you upgrade it. You may not even need to upgrade Spark. Otherwise, you still have the option of shading.

conderls commented 12 months ago

Hi @conderls! I would prefer not to add such complexity to the library.

Considering that parquet-hadoop in version 1.10.1 is five years old, I recommend you upgrade it. You may not even need to upgrade Spark. Otherwise, you still have the option of shading.

  1. I agree to keep pace with the latest version, but the v1.12.x add the getPath() method breaks the compatibility and just return the path.
  2. As spark-sql 3.1.2 depends on parquet-hadoop as provided scope, if without upgrade spark, then I may not able to upgrade parquet-hadoop;
  3. the spark-sql depends on parquet-hadoop 1.10.1 and parquet4s depends on 1.13.1, the shading option may not work here?
mjakubowski84 commented 12 months ago

2) Check out https://hadoopsters.com/how-to-override-a-spark-dependency-in-client-or-cluster-mode-2860a64ad1d5 3) In this option, you should shade parquet4s' dep