Open pascalwhoop opened 1 month ago
It does seem nuts to me that it silently overwrites files https://github.com/pola-rs/polars/issues/18242.
( version 1.7.1 & 1.9.0 ) I think maybe this is solved because I was able to append just fine (as long as it is a new partition, haven't tested existing partition behaviour).
Example code modified (from the linked issue):
df_a = pl.DataFrame(
{
'type': ['a','b'],
'date': ['2024-08-15','2024-08-16'],
'value': [68,70]
}
)
df_a.write_parquet(f'./example_part.parquet', partition_by='date')
df_b = pl.DataFrame(
{
'type': ['a','b'],
'date': ['2024-08-17','2024-08-18'],
'value': [72,74]
}
)
df_b.write_parquet(f'./example_part.parquet', partition_by='date')
pl.read_parquet(f'./example_part.parquet')
returns:
type | date | value
-- | -- | --
str | str | i64
"a" | "2024-08-15" | 68
"b" | "2024-08-16" | 70
"a" | "2024-08-17" | 72
"b" | "2024-08-18" | 74
Description
Hey. Spark has
mode=append
for writing parquet files. This is kind of useful, it just adds more partitions to the folder of an existing dataset. Great for writing in batches across multiple runs.How would you solve this in polars? I know adding data to an existing parquet file is a whole different game but just adding more files should be fairly OK no? I suspect, just not overwriting / deleting the whole existing folder structure should do the trick.
Edit
Digging into this, I realize there's a way already with partitioned data when the partition we write to is unique / always new (e.g. by generating a run_id column)
Polars writes parquet like this
and yarrow has default behavior
overwrite_or_ignore
so it should just add more files and ignore the existing ones. Exactly what I was looking for. Will whip up quick example.