segmentio / parquet-go

Go library to read/write Parquet files
https://pkg.go.dev/github.com/segmentio/parquet-go
Apache License 2.0
341 stars 102 forks source link

parquet: add option to reset SortingWriter with a different sortRowCount #477

Closed asubiotto closed 1 year ago

asubiotto commented 1 year ago

The motivation for this is to allow callers that know how many rows they want to write to be able to reuse pooled SortingWriters.

This is a proposal for now, I have yet to integrate this into our (Polar Signals') usage, but I foresee us needing something like this. Happy to discuss.

asubiotto commented 1 year ago

Another option is to just use math.MaxInt64 and Flush when we're done.

asubiotto commented 1 year ago

We're probably not going to use the SortingWriter given it's a 30% performance hit vs just writing to a buffer, sorting and copying to a writer so happy to close this PR.

kevinburkesegment commented 1 year ago

Maybe we need to document the SortingWriter is supposed to be used for stuff where the working set does not fit in memory? (This is its intended use case).

kevinburkesegment commented 1 year ago

Achille also recommends using the RowBuffer (instead of just a buffer) but maybe you are doing that.

asubiotto commented 1 year ago

Thanks! Actually we are not using the RowBuffer. Will try it out.