Closed fpetkovski closed 1 year ago
Attaching some benchmark results with the fix in place. They actually look awful so we probably want a different solution.
MergeRowGroups/BOOLEAN/groups=1,rows=10000-8 22.6µs ± 1% 43.8µs ± 1% +93.70% (p=0.008 n=5+5)
MergeRowGroups/BOOLEAN/groups=2,rows=20000-8 31.7µs ± 2% 53.0µs ± 2% +67.36% (p=0.008 n=5+5)
MergeRowGroups/BOOLEAN/groups=3,rows=30000-8 40.4µs ± 0% 63.1µs ± 5% +56.11% (p=0.008 n=5+5)
MergeRowGroups/INT32/groups=1,rows=10000-8 19.1µs ± 0% 40.5µs ± 0% +111.62% (p=0.008 n=5+5)
MergeRowGroups/INT32/groups=2,rows=20000-8 33.8µs ± 2% 55.6µs ± 1% +64.40% (p=0.008 n=5+5)
MergeRowGroups/INT32/groups=3,rows=30000-8 45.4µs ± 2% 67.9µs ± 0% ~ (p=0.333 n=5+1)
name old row/s new row/s delta
MergeRowGroups/BOOLEAN/groups=1,rows=10000-8 44.2M ± 1% 22.8M ± 1% -48.38% (p=0.008 n=5+5)
MergeRowGroups/BOOLEAN/groups=2,rows=20000-8 31.6M ± 2% 18.9M ± 2% -40.25% (p=0.008 n=5+5)
MergeRowGroups/BOOLEAN/groups=3,rows=30000-8 24.7M ± 0% 15.9M ± 5% -35.90% (p=0.008 n=5+5)
MergeRowGroups/INT32/groups=1,rows=10000-8 52.3M ± 0% 24.7M ± 0% -52.75% (p=0.008 n=5+5)
MergeRowGroups/INT32/groups=2,rows=20000-8 29.6M ± 2% 18.0M ± 1% -39.18% (p=0.008 n=5+5)
MergeRowGroups/INT32/groups=3,rows=30000-8 22.0M ± 2% 14.7M ± 0% ~ (p=0.333 n=5+1)
name old alloc/op new alloc/op delta
MergeRowGroups/BOOLEAN/groups=1,rows=10000-8 497B ± 0% 24497B ± 0% +4828.97% (p=0.008 n=5+5)
MergeRowGroups/BOOLEAN/groups=2,rows=20000-8 492B ± 0% 24493B ± 0% +4878.25% (p=0.008 n=5+5)
MergeRowGroups/BOOLEAN/groups=3,rows=30000-8 492B ± 0% 24493B ± 0% +4878.25% (p=0.016 n=5+4)
MergeRowGroups/INT32/groups=1,rows=10000-8 497B ± 0% 24497B ± 0% +4828.97% (p=0.008 n=5+5)
MergeRowGroups/INT32/groups=2,rows=20000-8 492B ± 0% 24493B ± 0% +4878.25% (p=0.008 n=5+5)
MergeRowGroups/INT32/groups=3,rows=30000-8 492B ± 0% 24493B ± 0% ~ (p=0.333 n=5+1)
name old allocs/op new allocs/op delta
MergeRowGroups/BOOLEAN/groups=1,rows=10000-8 2.00 ± 0% 1002.00 ± 0% +50000.00% (p=0.008 n=5+5)
MergeRowGroups/BOOLEAN/groups=2,rows=20000-8 1.00 ± 0% 1001.00 ± 0% +100000.00% (p=0.008 n=5+5)
MergeRowGroups/BOOLEAN/groups=3,rows=30000-8 1.00 ± 0% 1001.00 ± 0% +100000.00% (p=0.008 n=5+5)
MergeRowGroups/INT32/groups=1,rows=10000-8 2.00 ± 0% 1002.00 ± 0% +50000.00% (p=0.008 n=5+5)
MergeRowGroups/INT32/groups=2,rows=20000-8 1.00 ± 0% 1001.00 ± 0% +100000.00% (p=0.008 n=5+5)
MergeRowGroups/INT32/groups=3,rows=30000-8 1.00 ± 0% 1001.00 ± 0% ~ (p=0.333 n=5+1)
Looks related to what is discussed in https://github.com/segmentio/parquet-go/issues/475.
Apologies to make more work for you, but we've decided to move development on this project to a new organization at https://github.com/parquet-go/parquet-go to ensure its long term success. We appreciate your contribution and would appreciate if you could reopen this PR there if it is still relevant.
Will close this PR and reopen a new one in the new repository.
Merging sorted row groups sometimes leads to data corruption, even when merging rows from a single reader. I suspect that values in rows returned by
head()
are getting overwritten when new rows are read into the buffer. CallingClone()
on the return value ofhead()
fixes the problem: https://github.com/segmentio/parquet-go/blob/510764ae9e8007c4974dd8a1dc42f63f4bbcac4e/merge.go#L232I have also added a test case which consistently reproduces the bug. I am happy to add in the fix, but I am not entirely whether cloning each row would have performance implications.