segmentio / parquet-go

Go library to read/write Parquet files
https://pkg.go.dev/github.com/segmentio/parquet-go
Apache License 2.0
341 stars 58 forks source link

search: add comments to search.go #388

Closed abraithwaite closed 2 years ago

abraithwaite commented 2 years ago

I was going through this function to check my understanding of page indexes while thinking about granules in TraceDB.

I believe that granules in other databases like clickhouse and frostdb are the equivlant of column pages in parquet. The page index is the sparse index over the pages (granules) wherein the data being searched for is contained.

This check on the binary search mostly confirms that suspicion, if I'm not mistaken. Please correct me if I'm misunderstaneding something.

Also, as part of this, I found what is possibly an optimization to avoid skipping ahead a page, by comparing the current value to the max in the current page, instead of letting the binary search check the next page, then falling through to the final if-block.

It may or may not be a bug also, if, for example, the following happens:

you have pages:

[1,2,3]
[3,4,5]
[6,7,8]

Searching for 3 in this case would not return the first page and instead return the second, missing values of 3 in the first page.

I'll have to look at writing a test for this, but the current search test will pass creating this scenario without altering the test somewhat (since it's only looking for presence, not the expected index)

abraithwaite commented 2 years ago

If I'm understanding correctly the current version of the algorithm returns the index of the last page with min value greater or equal to the value being searched.

That's not my understanding. My understanding is that the current algorithm returns early if it finds a page with minValue == searchValue, which could be the first page, last page, or a page in the middle in the case where an ID spans 3 or more pages. It all depends on where it exists in the binary search.

I'm curious whether an alternative approach could be to do a linear search backwards from that index after the binary search? It might keep the code simpler and avoid doing two calls to the comparison function per iteration?

This is what I speculated in the review comments. I feel while it may be a good minor optimization, I believe it would complicate the code somewhat and in the worst case (a table full of the same ID) take O(N/2) time to scan to the beginning of the ColumnChunk.

Edit: since linear search will have a O(N) worst case we could also do a second binary search on [0:index] looking for the first page where the max value is greater or equal to the value we are looking for?

How is this different from what's implemented in this PR?

achille-roussel commented 2 years ago

My understanding is that the current algorithm returns early if it finds a page with minValue == searchValue

OK, my mistake.

How is this different from what's implemented in this PR?

The difference would be that we would only do one comparison per iteration: first loop compares min values, second compares max values.

The change you are proposing is good, I just want to explore whether there are opportunities for simplification 👍

abraithwaite commented 2 years ago

The difference would be that we would only do one comparison per iteration: first loop compares min values, second compares max values.

Gotcha. I did think about this given the existing code was only doing one bounds check (on index.MinValue) a bit and couldn't figure out a way to handle this kind of case:

search: 5
[1,2,3]
[4,5,6]
[7,8,9]

Only checking Min on the first loop would skip to page [7,8,9] and would return a bad result (not in columnChunk), if I'm not mistaken.

Edit: I am mistaken, more to come.

abraithwaite commented 2 years ago

Took the feedback regarding multiple checks on every iteration and implemented it, while still not regressing on the bug found previously (#389).