Open gaborcsardi opened 5 months ago
It would be nice not to introduce new functions, I guess? But is having an append
argument in write_parquet()
better? I am not sure.
Is there an API that we can use to concatenate multiple files and also do appending?
Appending to a file potentially needs specific parameters, e.g. for matching columns. So maybe a new function that deals with both appending and concatenation is best?
We could have
append_parquet(file, ..., options = parquet_options())
where file
is the output file to append to (might or might not exist), and ...
are data frames or parquet files to append to it.
As for the row groups and pages to create, we can do something like
Otherwise create a new row group or a new page.
Some types will be difficult to merge, e.g. how do we merge factor levels for factor columns? Similarly, what do we do with ENUM
columns in Parquet files? Merge dictionaries if we are merging pages?
We can start with something simple:
Or more generally, concatenate Parquet files and data frames. Should be pretty simple to implement, if we can have a reasonable API.