r-lib / nanoparquet

R package to read and write Parquet files
https://nanoparquet.r-lib.org/
Other
54 stars 0 forks source link

Append to a parquet file #56

Open gaborcsardi opened 5 months ago

gaborcsardi commented 5 months ago

Or more generally, concatenate Parquet files and data frames. Should be pretty simple to implement, if we can have a reasonable API.

gaborcsardi commented 5 months ago

It would be nice not to introduce new functions, I guess? But is having an append argument in write_parquet() better? I am not sure.

Is there an API that we can use to concatenate multiple files and also do appending?

Appending to a file potentially needs specific parameters, e.g. for matching columns. So maybe a new function that deals with both appending and concatenation is best?

gaborcsardi commented 5 months ago

We could have

append_parquet(file, ..., options = parquet_options())

where file is the output file to append to (might or might not exist), and ... are data frames or parquet files to append to it.

As for the row groups and pages to create, we can do something like

Otherwise create a new row group or a new page.

Some types will be difficult to merge, e.g. how do we merge factor levels for factor columns? Similarly, what do we do with ENUM columns in Parquet files? Merge dictionaries if we are merging pages?

We can start with something simple: