segmentio / parquet-go

Go library to read/write Parquet files
https://pkg.go.dev/github.com/segmentio/parquet-go
Apache License 2.0
341 stars 102 forks source link

fix for reading less than expected number of rows (#469) #491

Closed ryandeivert closed 1 year ago

ryandeivert commented 1 year ago

resolves: https://github.com/segmentio/parquet-go/issues/469 related to: https://github.com/segmentio/parquet-go/issues/471

Background

The Read convenience method does not do any pagination across row reads, and simply returns the first group to be read in the buffer

Changes

Implementing simple "pagination" of rows to properly fill the buffer with expected set of rows

Testing

Adding unit test and fixture file for testing this with a file of 8000 rows

grantwwu commented 1 year ago

Does the test fail without the fix? In my testing, I've seen reader.Read read up to 2^15 rows at once. But that was with rows containing a single int64 (that were probably being stored in a single byte each due to delta encoding).

bartleyg commented 1 year ago

@ryandeivert thanks for the contribution! And sorry this was being worked on at a lower level to fix some related issues. Should be fixed by https://github.com/segmentio/parquet-go/pull/489 so pull main and try it out.

ryandeivert commented 1 year ago

@bartleyg great thanks!