Wrong results for a primary index with composite keys

GoogleCodeExporter commented 9 years ago

What steps will reproduce the problem?
I hit this issue by inspecting the code and formulated the following proof of 
concept.

Run the following:

drop dataverse test if exists;
create dataverse test;

use dataverse test;

create type FacebookMessageType as closed {
        message-id: int32,
        author-id: int32,
        in-response-to: int32?,
        sender-location: point?,
        message: string
}

create dataset FacebookMessages(FacebookMessageType)
primary key message-id, author-id;

load dataset FacebookMessages using localfs
(("path"="127.0.0.1://PUT_PATH_TO_ADM_FILE_HERE/fbm.adm"),("format"="adm"));

for $m in dataset('FacebookMessages')
where $m.message-id > 5
and $m.author-id > 0
return $m

The attached data file has only a single record that does not statisfy the 
query predicates; however, the system will return the record as part of the 
result set. If you change the key of the dataset to be a single key instead of 
a composite, the system will return the right answer.

I traced the issue and found that the bug occurs because when searching the 
primary index, we compare the search key which is <5,0> with the key of the 
record which is <5,6>. The result of the comparison is the search key is 
smaller than the record. Therefore, the index will blindly think that this 
record satisfy the search predicate and return it as a result.

I can think of two possible fixes:
1) modify the btree code to handle this case such that it will compare the keys 
individually instead of comparing them as one unit.
2) keep the original select operator in the plan to take care of false 
positives.

Original issue reported on code.google.com by salsuba...@gmail.com on 23 May 2014 at 2:39

Attachments:

fbm.adm

GoogleCodeExporter commented 9 years ago

Original comment by dtab...@gmail.com on 23 May 2014 at 9:03

Added labels: Priority-Critical
Removed labels: Priority-Medium

GoogleCodeExporter commented 9 years ago

I think the root of the issue is that this predicate is incorrectly 
rewritten/sarged. This *entire* predicate cannot actually be supported by our 
btree as a range predicate over both fields of the composite key since the 
search cursor only supports contiguous intervals (ranges). In other words, the 
range (<5,0>, <+inf,+inf>) is a superset of all keys with a > 5 and b > 0 (e.g. 
<6, -1>). Without implementing a specialized search cursor, what we could do is:
1) Use the b+tree to find all items with a > 5
2) Use a select operator to filter b > 0 afterwards

Thoughts?

Original comment by zheilb...@gmail.com on 24 May 2014 at 8:35

GoogleCodeExporter commented 9 years ago

I think searching the b+tree with both predicates is OK as long as you keep the 
original select operator, which will obviously prune "most" of the false 
positives as opposed if you just use a single predicate to query the tree.

Original comment by salsuba...@gmail.com on 24 May 2014 at 10:07

GoogleCodeExporter commented 9 years ago

Actually, I don't understand how searching the btree with both predicates will 
work.
Consider the following sequence of keys that might exist in the btree:

<6, 1>, <7, -1>, <8, 10>

Using a search predicate of a > 5, b > 0, then the cursor will find and return 
<6, 1>, but will not return <7, -1>. Moreover, the cursor will stop, neglecting 
<8, 10> which satisfies the predicate.

Do you agree?

Original comment by zheilb...@gmail.com on 25 May 2014 at 3:14

GoogleCodeExporter commented 9 years ago

Actually that's not how the btree range cursor is implemented. We don't compare 
every record with the search key. Instead, whenever a new leaf page is fetched, 
we binary search it and decide what is the last record's index that should be 
part of the result set. Once this is decided, all records with smaller indexes 
are returned as result.

Original comment by salsuba...@gmail.com on 25 May 2014 at 4:25

GoogleCodeExporter commented 9 years ago

Right, so when you binary search and land on <6, -1>, what happens?

Original comment by zheilb...@gmail.com on 25 May 2014 at 4:28

GoogleCodeExporter commented 9 years ago

Ah, so you might get end up with false negatives too. It is even worse than 
what I thought :-)

So my second solution is out of the picture. Either do my first solution, or do 
some sort of prefix search that still guarantee you will get the correct result 
and filter out the false positives. You might want to double check your 
proposal always guarantee to return the correct result for all different 
scenarios (e.g., composite keys with more than 2 fields and different search 
predicates).

Original comment by salsuba...@gmail.com on 25 May 2014 at 4:42

namhnguyen / asterixdb

Wrong results for a primary index with composite keys #777