lwhay / asterixdb

Automatically exported from code.google.com/p/asterixdb
0 stars 0 forks source link

Fuzzy Selection (with Existential Quantification) does not use Inverted Index #731

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
Assuming the following schema:

create type TwitterUserType {
screen_name: string,
lang: string,
friends_count: int32,
statuses_count: int32,
name: string,
followers_count: int32
}

create type TweetMessageType {
tweetid: int64,
user: TwitterUserType,
sender_location: point?,
send_time: datetime,
referred_topics: {{ string }},
message_text: string
}

create dataset TweetMessages(TweetMessageType) primary key tweetid;

create index tNGramIdx on TweetMessages(message_text) type ngram(3);

Follow query does not use ngram index:

for $t in dataset TweetMessages
 where
(some $word in word-tokens($t.message_text)
 satisfies edit-distance-check($word, "blah", 3)[0] )
return {
  "id" : $t.tweetid,
  "message" : $t.message_text
}

But a simplified version of query uses inverted index:

for $t in dataset TweetMessages
let $ed := edit-distance-check($t.message_text, "blah", 2)
where $ed[0]
return {
  "id" : $t.tweetid,
  "message" : $t.message_text
}

Original issue reported on code.google.com by pouria.p...@gmail.com on 14 Mar 2014 at 8:46

GoogleCodeExporter commented 9 years ago
It seems it is important which function is being used and with which argument:

This query uses index:

for $t in dataset TweetMessages
let $ed := edit-distance($t.message_text, "Blah Blah")
where $ed <= 2
return {
  "id" : $t.tweetid,
  "message" : $t.message-text
}

But if you replace "Blah Blah", with "Blah" it no longer uses index.
It is also the same for the query that was paste in original bug report:
(It uses index with "blah blah" but not with "blah":

//Index is used here
for $t in dataset TweetMessages
let $ed := edit-distance-check($t.message_text, "blah blah", 2)
where $ed[0]
return {
  "id" : $t.tweetid,
  "message" : $t.message_text
}

Original comment by pouria.p...@gmail.com on 14 Mar 2014 at 10:45

GoogleCodeExporter commented 9 years ago
Hey Pouria,

I have implemented a new function to solve your problem. Can you change your 
query to the following:

for $t in dataset TweetMessages
let $ed := edit-distance-contains($t.message_text, "Blah Blah", 2)
return {
  "id" : $t.tweetid,
  "message" : $t.message-text
}

This query will give you the results that contains a similar substring to "Blah 
Blah".

Original comment by icetin...@gmail.com on 1 May 2014 at 11:13

GoogleCodeExporter commented 9 years ago
Great !
This is awesome
Thanks Inci ...

Original comment by pouria.p...@gmail.com on 1 May 2014 at 11:17

GoogleCodeExporter commented 9 years ago
As you figured out the parameters given to edit distance-related functions 
affects the index usage.  Basically we have a formula to decide how many ngrams 
need to match between the query and the record. We use index if that number (T) 
is greater than 0, otherwise we don't use the index (this case is called "panic 
case").  We compute T as follows:

T = Number_of_grams_in_query - gram_length * threshold

Now, if we use edit-distance() or edit-distance-check() number of grams in 
query (Q) is computed as follows:

 Q = Length_of_query_string + gram_length - 1

If we use edit-distance-contains():
 Q = Length_of_query_string - gram_length + 1

Based on this formula if T > 0, it will rewrite the query using the inverted 
index.

Original comment by icetin...@gmail.com on 1 May 2014 at 11:35

GoogleCodeExporter commented 9 years ago
I am closing this issue; however the existential query that Pouria came up as a 
workaround still doesn't work. We will decide what to do about existential 
queries in issue 654.

Original comment by icetin...@gmail.com on 1 May 2014 at 11:39

GoogleCodeExporter commented 9 years ago

Original comment by icetin...@gmail.com on 1 May 2014 at 11:39

GoogleCodeExporter commented 9 years ago
This issue was closed by revision be353dd4a54e.

Original comment by kiss...@gmail.com on 23 May 2014 at 8:09

GoogleCodeExporter commented 9 years ago
This issue was closed by revision be353dd4a54e.

Original comment by kiss...@gmail.com on 9 Jun 2014 at 6:41