segmentio / parquet-go

Go library to read/write Parquet files
https://pkg.go.dev/github.com/segmentio/parquet-go
Apache License 2.0
341 stars 102 forks source link

Preserve data integrity when merging rows #512

Closed fpetkovski closed 1 year ago

fpetkovski commented 1 year ago

Merging sorted row groups sometimes leads to data corruption, even when merging rows from a single reader. I suspect that values in rows returned by head() are getting overwritten when new rows are read into the buffer. Calling Clone() on the return value of head() fixes the problem: https://github.com/segmentio/parquet-go/blob/510764ae9e8007c4974dd8a1dc42f63f4bbcac4e/merge.go#L232

I have also added a test case which consistently reproduces the bug. I am happy to add in the fix, but I am not entirely whether cloning each row would have performance implications.

fpetkovski commented 1 year ago

Attaching some benchmark results with the fix in place. They actually look awful so we probably want a different solution.

MergeRowGroups/BOOLEAN/groups=1,rows=10000-8    22.6µs ± 1%    43.8µs ± 1%      +93.70%  (p=0.008 n=5+5)
MergeRowGroups/BOOLEAN/groups=2,rows=20000-8    31.7µs ± 2%    53.0µs ± 2%      +67.36%  (p=0.008 n=5+5)
MergeRowGroups/BOOLEAN/groups=3,rows=30000-8    40.4µs ± 0%    63.1µs ± 5%      +56.11%  (p=0.008 n=5+5)
MergeRowGroups/INT32/groups=1,rows=10000-8      19.1µs ± 0%    40.5µs ± 0%     +111.62%  (p=0.008 n=5+5)
MergeRowGroups/INT32/groups=2,rows=20000-8      33.8µs ± 2%    55.6µs ± 1%      +64.40%  (p=0.008 n=5+5)
MergeRowGroups/INT32/groups=3,rows=30000-8      45.4µs ± 2%    67.9µs ± 0%         ~     (p=0.333 n=5+1)

name                                          old row/s      new row/s      delta
MergeRowGroups/BOOLEAN/groups=1,rows=10000-8     44.2M ± 1%     22.8M ± 1%      -48.38%  (p=0.008 n=5+5)
MergeRowGroups/BOOLEAN/groups=2,rows=20000-8     31.6M ± 2%     18.9M ± 2%      -40.25%  (p=0.008 n=5+5)
MergeRowGroups/BOOLEAN/groups=3,rows=30000-8     24.7M ± 0%     15.9M ± 5%      -35.90%  (p=0.008 n=5+5)
MergeRowGroups/INT32/groups=1,rows=10000-8       52.3M ± 0%     24.7M ± 0%      -52.75%  (p=0.008 n=5+5)
MergeRowGroups/INT32/groups=2,rows=20000-8       29.6M ± 2%     18.0M ± 1%      -39.18%  (p=0.008 n=5+5)
MergeRowGroups/INT32/groups=3,rows=30000-8       22.0M ± 2%     14.7M ± 0%         ~     (p=0.333 n=5+1)

name                                          old alloc/op   new alloc/op   delta
MergeRowGroups/BOOLEAN/groups=1,rows=10000-8      497B ± 0%    24497B ± 0%    +4828.97%  (p=0.008 n=5+5)
MergeRowGroups/BOOLEAN/groups=2,rows=20000-8      492B ± 0%    24493B ± 0%    +4878.25%  (p=0.008 n=5+5)
MergeRowGroups/BOOLEAN/groups=3,rows=30000-8      492B ± 0%    24493B ± 0%    +4878.25%  (p=0.016 n=5+4)
MergeRowGroups/INT32/groups=1,rows=10000-8        497B ± 0%    24497B ± 0%    +4828.97%  (p=0.008 n=5+5)
MergeRowGroups/INT32/groups=2,rows=20000-8        492B ± 0%    24493B ± 0%    +4878.25%  (p=0.008 n=5+5)
MergeRowGroups/INT32/groups=3,rows=30000-8        492B ± 0%    24493B ± 0%         ~     (p=0.333 n=5+1)

name                                          old allocs/op  new allocs/op  delta
MergeRowGroups/BOOLEAN/groups=1,rows=10000-8      2.00 ± 0%   1002.00 ± 0%   +50000.00%  (p=0.008 n=5+5)
MergeRowGroups/BOOLEAN/groups=2,rows=20000-8      1.00 ± 0%   1001.00 ± 0%  +100000.00%  (p=0.008 n=5+5)
MergeRowGroups/BOOLEAN/groups=3,rows=30000-8      1.00 ± 0%   1001.00 ± 0%  +100000.00%  (p=0.008 n=5+5)
MergeRowGroups/INT32/groups=1,rows=10000-8        2.00 ± 0%   1002.00 ± 0%   +50000.00%  (p=0.008 n=5+5)
MergeRowGroups/INT32/groups=2,rows=20000-8        1.00 ± 0%   1001.00 ± 0%  +100000.00%  (p=0.008 n=5+5)
MergeRowGroups/INT32/groups=3,rows=30000-8        1.00 ± 0%   1001.00 ± 0%         ~     (p=0.333 n=5+1)
fpetkovski commented 1 year ago

Looks related to what is discussed in https://github.com/segmentio/parquet-go/issues/475.

kevinburkesegment commented 1 year ago

Apologies to make more work for you, but we've decided to move development on this project to a new organization at https://github.com/parquet-go/parquet-go to ensure its long term success. We appreciate your contribution and would appreciate if you could reopen this PR there if it is still relevant.

fpetkovski commented 1 year ago

Will close this PR and reopen a new one in the new repository.