lwhay / asterixdb

Automatically exported from code.google.com/p/asterixdb
0 stars 0 forks source link

Existential query does not use index #654

Open GoogleCodeExporter opened 9 years ago

GoogleCodeExporter commented 9 years ago
for $t in dataset Tweets 
where some $term in ["#xboxkinect", "xbox", "#xbox", "xbox360", "#xbox360", 
"xboxone", "#xboxone"] satisfies contains($t.text, $term)
group by $b := interval-bin($t.created_at, datetime("2013-01-06T00:00:00"), 
day-time-duration("P7D")) with $t
return {"b": $b, "count": count($t)}

Currently we do not turn the existential query into a disjunction (which would 
have used the inverted index).

Original issue reported on code.google.com by vinay...@gmail.com on 23 Oct 2013 at 7:27

GoogleCodeExporter commented 9 years ago

Original comment by vinay...@gmail.com on 23 Oct 2013 at 7:28

GoogleCodeExporter commented 9 years ago
Check if the disjunction works and whether or not it uses the index.

Original comment by zheilb...@gmail.com on 29 Oct 2013 at 11:48

GoogleCodeExporter commented 9 years ago

Original comment by zheilb...@gmail.com on 8 Nov 2013 at 6:26

GoogleCodeExporter commented 9 years ago
Tested with disjunction and does not work. Only works if there is a single 
predicate.

For clarity, this query will not pick the inverted index:
use dataverse azure;

for $t in dataset Tweets 
where contains($t.text, "#xboxkinect") or contains($t.text, "xbox") or 
contains($t.text, "#xbox") or contains($t.text, "xbox360") or contains($t.text, 
"#xbox360") or contains($t.text, "xboxone") or contains($t.text, "#xboxone")
group by $b := interval-bin($t.created_at, datetime("2013-01-06T00:00:00"), 
day-time-duration("P7D")) with $t
return {"b": $b, "count": count($t)}

This query WILL pick the inverted index:
use dataverse azure;

for $t in dataset Tweets 
where contains($t.text, "#xboxkinect")
group by $b := interval-bin($t.created_at, datetime("2013-01-06T00:00:00"), 
day-time-duration("P7D")) with $t
return {"b": $b, "count": count($t)}

Original comment by zheilb...@gmail.com on 8 Nov 2013 at 6:36

GoogleCodeExporter commented 9 years ago
Does this query pickup a regular B-Tree index, if one was defined over $t.text 
field?

for $t in dataset Tweets
where contains($t.text, "#xboxkinect")
return $t

From what I remember, the use of contains(...) or any other function in the 
WHERE clause did not lead to an index look up. You may want to verify if that 
is still the case or if that was fixed recently.

Original comment by khfaraaz82 on 8 Nov 2013 at 7:05

GoogleCodeExporter commented 9 years ago

Original comment by zheilb...@gmail.com on 15 Nov 2013 at 8:31

GoogleCodeExporter commented 9 years ago

Original comment by vinay...@gmail.com on 15 Nov 2013 at 8:33

GoogleCodeExporter commented 9 years ago
We have more basic problems with in-lists and disjunctive queries as well. 

Original comment by dtab...@gmail.com on 18 Feb 2014 at 4:56

GoogleCodeExporter commented 9 years ago
Just a note:

I'm working on a rewriting for eq-predicates that translates disjunctions to 
joins. 
So the equivalent query for this case would be:

for $t in dataset Tweets 
for $w in ["#xboxkinect", "xbox", "#xbox", "xbox360", "#xbox360", "xboxone", 
"#xboxone"]
where contains($t.text, $w) 
group by $b := interval-bin($t.created_at, datetime("2013-01-06T00:00:00"), 
day-time-duration("P7D")) with $t
return {"b": $b, "count": count($t)}

However, 
a) this wouldn't work for this case as we would introduce duplicates (every 
test that contains "xbox360" also contains "xbox") and 
b) we need to be sure that the join would pick the index correctly (as it does 
for eq).

Original comment by westm...@gmail.com on 1 May 2014 at 6:26

GoogleCodeExporter commented 9 years ago
discussion results:
- existential quantification and disjunction can both be rewritten into join
- for predicates that introduce duplicates like contains we need duplicate 
elimination on the outer (probe) side of the join
- the rewriting should only be done if an index is available (for INL join) or 
if we have a bulk-join operation for contains (like fuzzy joins)
- if an index is available we also have key that can be used for duplicate 
elimination

for a simple case the rewriting for existential quantification would look like 
this 

for $x in dataset Tweets
where some $t in ["...", "..."] satisfies contains($x.t, $t)
...

  ->

for $x in dataset Tweets
for $t in  ["...", "..."]
where contains($x.t, $t)
distinct by $x.key
...

Original comment by westm...@gmail.com on 16 May 2014 at 4:36