mjakubowski84 / parquet4s

Read and write Parquet in Scala. Use Scala classes as schema. No need to start a cluster.
https://mjakubowski84.github.io/parquet4s/
MIT License
283 stars 66 forks source link

Partitions with nested directories return zero rows #352

Closed egorsmth closed 3 months ago

egorsmth commented 3 months ago

parquet4s version 2.18.0

I got 3 parquet files part-0000, part-0001, part-0002 inside directory on a S3 bucket.

- multipart_parquet
   - a1
     - part-0000.snappy.parquet
     - part-0001.snappy.parquet
     - part-0002.snappy.parquet

With this url s3a://parquet-driver-spec/multipart_parquet/a1 parquet4s read partitions and returns all rows.

But if files are in this structure

- multipart_parquet
   - a1
     - b1
       - part-0000.snappy.parquet
     - b2
       - part-0001.snappy.parquet
       - part-0002.snappy.parquet

parquet4s returns 0 rows

code:

val hadoopConfig = { // simplified for issue
    "fs.s3a.impl.disable.cache", "true"
    "fs.s3a.path.style.access", "true"
    other settings like secrets and url
}
ParquetReader
        .projectedGeneric(querySchema)
        .options(Options(hadoopConf = hadoopConfig))
        .read(Path("s3a://parquet-driver-spec/multipart_parquet/a1"))
egorsmth commented 3 months ago

I guess I misunderstand how partitions should be structured. The names of folders are incorrect in my example, they should be b=1 and b=2, with that all good