mjakubowski84 / parquet4s

Read and write Parquet in Scala. Use Scala classes as schema. No need to start a cluster.
https://mjakubowski84.github.io/parquet4s/
MIT License
283 stars 66 forks source link

partitioning incompatibility with spark #353

Closed dpogibelskiy closed 1 week ago

dpogibelskiy commented 1 month ago

Hi,

I have faced an issue with incompatibility of how Parquet4s (2.18.0) and spark interpret folder with partitioned parquets. If root folder contains partition subfolders like key=value and value is escaped string than parquet4s ParquetReader ignores this subfolder.

Steps to reproduce:

Using spark 3.5.1 with all default options and minio as s3 storage I prepare test dadaset:

    val rows = List(
      Row(null, "data1"),
      Row("a", "data2"),
      Row("a=2", "data3")
    )
    val df = spark.createDataFrame(
      rows.asJava,
      StructType(
        Array(StructField("a", StringType), StructField("b", StringType))
      )
    )
    df.write.partitionBy("a").parquet("testpath")

This creates folder structure:

testpath
 |- a=2
 |- a=__HIVE_DEFAULT_PARTITION__
 |- a=a%3D2

I can read back all 3 records using spark without any manipulations with options:

+-----+----+
|    b|   a|
+-----+----+
|data3| a=2|
|data1|NULL|
|data2|   a|
+-----+----+

Parquet4s ParquetReader.generic.options(opt).read gives only 2 records:

 b=BinaryValue(Binary{5 constant bytes, [100, 97, 116, 97, 49]})  a=BinaryValue(Binary{"__HIVE_DEFAULT_PARTITION__"}) 
 b=BinaryValue(Binary{5 constant bytes, [100, 97, 116, 97, 50]})  a=BinaryValue(Binary{"a"})
mjakubowski84 commented 1 month ago

Hi,

Yes, url-encoding of partition values is something that Parquet4s misses. The partition is not read simply because it doesn't match the defined regex. And regex is defined based on advice from here: https://docs.aws.amazon.com/AmazonS3/latest/userguide/object-keys.html However, even aforementioned page mentions url-encoding. So, that's a bug!

mjakubowski84 commented 1 month ago

To be fixed by #355

mjakubowski84 commented 1 week ago

Fix released in https://github.com/mjakubowski84/parquet4s/releases/tag/v2.19.0