mjakubowski84 / parquet4s

Read and write Parquet in Scala. Use Scala classes as schema. No need to start a cluster.
https://mjakubowski84.github.io/parquet4s/
MIT License
283 stars 65 forks source link

Options `ParquetFileWriter.Mode.OVERWRITE` not deleting old parquet files in S3 #311

Closed jeet23 closed 1 year ago

jeet23 commented 1 year ago

I am using the rotatingWriter with generic types and passing the ParquetWriter.Options to overwrite the parquet files in S3 (with the hadoop-aws connector).

I have provided the correct IAM roles etc (PutObject, DeleteObject) to the IAM role being used, yet I see that the old files do not get removed but instead the new file is appended to the parquet directory. (i.e. the default ParquetFileWriter.Mode.CREATE is getting picked up from here).

Are there any gotchas while using the "OVERWRITE" mode? Or am I doing something incorrect?

My FS2 pipe for writing the parquet file looks like this :

  val writePipe = viaParquet[F]
     .generic
     .options(ParquetWriter.Options(hadoopConf = conf, writeMode = ParquetFileWriter.Mode.OVERWRITE))
     .write(hadoopFilePath, messageSchema)

where messageSchema denotes the MessageType and hadoopFilePath is the Path in S3.

Screenshot 2023-08-14 at 13 32 25

jeet23 commented 1 year ago

Apologies, I figured out that this issue only happens in local (using localstack) and works fine in real AWS.