namhnguyen / asterixdb

Automatically exported from code.google.com/p/asterixdb
0 stars 0 forks source link

(Left outer) Fuzzy join with Jacard does not use keyword index #741

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
Assuming following index exists:

create index topicIIx on TweetMessages(referred_topics) type keyword;

Fuzzy join query below does not pick index (Note that it *could* be related to 
Issue 282):

for $t in dataset('TweetMessages')
where $t.send_time >= datetime('2011-06-18T14:10:17')
and 
$t.send_time < datetime('2011-06-18T15:10:17')
return {                     
    "tweet": $t.tweetid,               
    "similar-tweets": for $t2 in dataset('TweetMessages')
                      let $sim := similarity-jaccard-check($t.referred_topics, $t2.referred_topics, 0.6f)
              where $sim[0] and 
                      $t2.tweetid != $t.tweetid
                      return $t2.tweetid
}

If we switch from left outer join to a regular join, query picks the index, but 
the problem is selection on "send_time" attribute does not get pushed down, 
which makes the query expensive on large datasets:

for $t1 in dataset('TweetMessages')
for $t2 in dataset('TweetMessages')
let $sim := similarity-jaccard-check($t1.referred_topics, $t2.referred_topics, 
0.6f)
where $sim[0] and $t2.tweetid != $t1.tweetid
and $t1.send_time >= datetime('2006-06-18T14:10:17') and 
$t1.send_time < datetime('2012-06-18T15:10:17')
return {                     
    "t1": $t1.tweetid,               
    "t2": $t2.tweetid
}

Here is the plan for the revision:

distribute result [%0->$$19]
-- DISTRIBUTE_RESULT  |PARTITIONED|
  exchange 
  -- ONE_TO_ONE_EXCHANGE  |PARTITIONED|
    project ([$$19])
    -- STREAM_PROJECT  |PARTITIONED|
      assign [$$19] <- [function-call: asterix:closed-record-constructor, Args:[AString: {t1}, %0->$$25, AString: {t2}, %0->$$26]]
      -- ASSIGN  |PARTITIONED|
        project ([$$25, $$26])
        -- STREAM_PROJECT  |PARTITIONED|
          select (function-call: algebricks:and, Args:[function-call: algebricks:neq, Args:[%0->$$26, %0->$$25], function-call: asterix:get-item, Args:[function-call: asterix:similarity-jaccard-check, Args:[function-call: asterix:field-access-by-index, Args:[%0->$$0, AInt32: {6}], %0->$$28, AFloat: {0.6}], AInt32: {0}]])
          -- STREAM_SELECT  |PARTITIONED|
            project ([$$0, $$25, $$26, $$28])
            -- STREAM_PROJECT  |PARTITIONED|
              select (function-call: algebricks:and, Args:[function-call: algebricks:lt, Args:[%0->$$22, ADateTime: { 2012-06-18T15:10:17.000Z }], function-call: algebricks:ge, Args:[%0->$$22, ADateTime: { 2006-06-18T14:10:17.000Z }]])
              -- STREAM_SELECT  |PARTITIONED|
                assign [$$22] <- [function-call: asterix:field-access-by-index, Args:[%0->$$0, AInt32: {4}]]
                -- ASSIGN  |PARTITIONED|
                  project ([$$0, $$25, $$26, $$28])
                  -- STREAM_PROJECT  |PARTITIONED|
                    exchange 
                    -- ONE_TO_ONE_EXCHANGE  |PARTITIONED|
                      unnest-map [$$25, $$0] <- function-call: asterix:index-search, Args:[AString: {TweetMessages}, AInt32: {0}, AString: {NewSocialNetworkData}, AString: {TweetMessages}, ABoolean: {true}, ABoolean: {false}, AInt32: {1}, %0->$$33, AInt32: {1}, %0->$$33, TRUE, TRUE, FALSE]
                      -- BTREE_SEARCH  |PARTITIONED|
                        exchange 
                        -- ONE_TO_ONE_EXCHANGE  |PARTITIONED|
                          order (ASC, %0->$$33) 
                          -- STABLE_SORT [$$33(ASC)]  |PARTITIONED|
                            exchange 
                            -- ONE_TO_ONE_EXCHANGE  |PARTITIONED|
                              unnest-map [$$33] <- function-call: asterix:index-search, Args:[AString: {topicIIx}, AInt32: {4}, AString: {NewSocialNetworkData}, AString: {TweetMessages}, ABoolean: {true}, ABoolean: {true}, AInt32: {1}, AFloat: {0.6}, AInt32: {22}, AInt32: {1}, %0->$$28]
                              -- LENGTH_PARTITIONED_INVERTED_INDEX_SEARCH  |PARTITIONED|
                                exchange 
                                -- BROADCAST_EXCHANGE  |PARTITIONED|
                                  project ([$$26, $$28])
                                  -- STREAM_PROJECT  |PARTITIONED|
                                    assign [$$28] <- [function-call: asterix:field-access-by-index, Args:[%0->$$32, AInt32: {6}]]
                                    -- ASSIGN  |PARTITIONED|
                                      exchange 
                                      -- ONE_TO_ONE_EXCHANGE  |PARTITIONED|
                                        data-scan []<-[$$26, $$32] <- NewSocialNetworkData:TweetMessages
                                        -- DATASOURCE_SCAN  |PARTITIONED|
                                          exchange 
                                          -- ONE_TO_ONE_EXCHANGE  |PARTITIONED|
                                            empty-tuple-source
                                            -- EMPTY_TUPLE_SOURCE  |PARTITIONED|

Original issue reported on code.google.com by pouria.p...@gmail.com on 20 Mar 2014 at 6:18

GoogleCodeExporter commented 9 years ago

Original comment by icetin...@gmail.com on 21 Mar 2014 at 6:09

GoogleCodeExporter commented 9 years ago
In the regular join the "send_time" selection cannot be pushed further. The 
reason is there are two btree search in the query plan. The first one is 
secondary index search, and the second one is to get the rest of the record by 
doing a primary index look up. So the selection can be done after the entire 
record is retrieved. If you do the selection on the other side of the join 
($t2.send_time), then this selection will be done before the index look up.

Original comment by icetin...@gmail.com on 21 Mar 2014 at 6:14

GoogleCodeExporter commented 9 years ago

Original comment by kiss...@gmail.com on 17 Jun 2014 at 12:39