segmentio / parquet-go

Go library to read/write Parquet files
https://pkg.go.dev/github.com/segmentio/parquet-go
Apache License 2.0
341 stars 58 forks source link

add parquet.SortingWriter #427

Closed achille-roussel closed 1 year ago

achille-roussel commented 1 year ago

This PR adds a new writer type named parquet.SortingWriter which ensures that rows written to row groups are always ordered according to the sorting columns passed as configuration.

The sorting strategy uses an in-memory buffer which gets sorted then serialized to a row group; when the writer is flushed or closed, all the row groups are merged while maintaining the global order of rows using a k-way sort.