namhnguyen / asterixdb

Automatically exported from code.google.com/p/asterixdb
0 stars 0 forks source link

Order by in nested plan of left-outer-join doesn't work correctly. #771

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
Order by in nested plan of left-outer-join(which doesn't use any index for the 
join) doesn't work.
All existing left-outer-join tests in master branch have only zero or one 
matched record, so even if the test queries include order by in the nested 
plan, it is not helpful to check whether the order by works or not. 
Thus, I think that order by in nested plan hasn't been tested at all so far.

#. DDL statements

drop dataverse test if exists;
create dataverse test;
use dataverse test;

create type TwitterUserType as closed {
    screen-name: string,
    lang: string,
    friends-count: int32,
    statuses-count: int32,
    name: string,
    followers-count: int32
} 

create type TweetMessageType as closed {
    tweetid: int64,
        user: TwitterUserType,
        sender-location: point,
    send-time: datetime,
        referred-topics: {{ string }},
    message-text: string,
    countA: int32,
    countB: int32
}

create dataset TweetMessages(TweetMessageType)
primary key tweetid;

create index twmSndLocIx on TweetMessages(sender-location) type rtree;
create index msgCountAIx on TweetMessages(countA) type btree;
create index msgCountBIx on TweetMessages(countB) type btree;
create index msgTextIx on TweetMessages(message-text) type keyword;

#. Update

use dataverse test;

load dataset TweetMessages 
using "edu.uci.ics.asterix.external.dataset.adapter.NCFileSystemAdapter"
(("path"="nc1://data/twitter/tw_for_indexleftouterjoin.adm"),("format"="adm"));

*tw_for_indexleftouterjoin.adm file is attached.

#. Query
use dataverse test;

for $t1 in dataset('TweetMessages')
where $t1.tweetid < int64("20")
order by $t1.tweetid
return {
"tweetid1": $t1.tweetid,
"count1":$t1.countA,
"t2info": for $t2 in dataset('TweetMessages') 
          where $t1.countA = $t2.countB 
          order by $t2.tweetid
          return {"tweetid2": $t2.tweetid,
                  "count2":$t2.countB}
};

#. Wrong Results (--> the order of tweetid2 is not maintained in the result)
{ "tweetid1": 1i64, "count1": 1, "t2info": [  ] }
{ "tweetid1": 2i64, "count1": 2, "t2info": [ { "tweetid2": 60i64, "count2": 2 } 
] }
{ "tweetid1": 3i64, "count1": 3, "t2info": [ { "tweetid2": 206i64, "count2": 3 
}, { "tweetid2": 105i64, "count2": 3 } ] }
{ "tweetid1": 4i64, "count1": 4, "t2info": [  ] }
{ "tweetid1": 5i64, "count1": 5, "t2info": [ { "tweetid2": 138i64, "count2": 5 
}, { "tweetid2": 175i64, "count2": 5 } ] }
{ "tweetid1": 6i64, "count1": 6, "t2info": [ { "tweetid2": 148i64, "count2": 6 
} ] }
{ "tweetid1": 7i64, "count1": 7, "t2info": [ { "tweetid2": 125i64, "count2": 7 
} ] }
{ "tweetid1": 8i64, "count1": 8, "t2info": [  ] }
{ "tweetid1": 9i64, "count1": 9, "t2info": [ { "tweetid2": 141i64, "count2": 9 
} ] }

Original issue reported on code.google.com by kiss...@gmail.com on 9 May 2014 at 8:44

Attachments:

GoogleCodeExporter commented 9 years ago
Here are the results that I am getting when I run this query in master (which 
seems correct to me):

{ "tweetid1": 1i64, "count1": 1, "t2info": [  ] }
{ "tweetid1": 2i64, "count1": 2, "t2info": [ { "tweetid2": 60i64, "count2": 2 } 
] }
{ "tweetid1": 3i64, "count1": 3, "t2info": [ { "tweetid2": 105i64, "count2": 3 
}, { "tweetid2": 206i64, "count2": 3 } ] }
{ "tweetid1": 4i64, "count1": 4, "t2info": [  ] }
{ "tweetid1": 5i64, "count1": 5, "t2info": [ { "tweetid2": 138i64, "count2": 5 
}, { "tweetid2": 175i64, "count2": 5 } ] }
{ "tweetid1": 6i64, "count1": 6, "t2info": [ { "tweetid2": 148i64, "count2": 6 
} ] }
{ "tweetid1": 7i64, "count1": 7, "t2info": [ { "tweetid2": 125i64, "count2": 7 
} ] }
{ "tweetid1": 8i64, "count1": 8, "t2info": [  ] }
{ "tweetid1": 9i64, "count1": 9, "t2info": [ { "tweetid2": 141i64, "count2": 9 
} ] }
{ "tweetid1": 10i64, "count1": 10, "t2info": [  ] }
{ "tweetid1": 11i64, "count1": 11, "t2info": [ { "tweetid2": 26i64, "count2": 
11 } ] }
{ "tweetid1": 12i64, "count1": 12, "t2info": [ { "tweetid2": 44i64, "count2": 
12 }, { "tweetid2": 61i64, "count2": 12 } ] }
{ "tweetid1": 13i64, "count1": 13, "t2info": [ { "tweetid2": 13i64, "count2": 
13 }, { "tweetid2": 181i64, "count2": 13 } ] }
{ "tweetid1": 14i64, "count1": 14, "t2info": [  ] }
{ "tweetid1": 15i64, "count1": 15, "t2info": [ { "tweetid2": 177i64, "count2": 
15 } ] }
{ "tweetid1": 16i64, "count1": 16, "t2info": [ { "tweetid2": 48i64, "count2": 
16 } ] }
{ "tweetid1": 17i64, "count1": 17, "t2info": [ { "tweetid2": 133i64, "count2": 
17 } ] }
{ "tweetid1": 18i64, "count1": 18, "t2info": [ { "tweetid2": 20i64, "count2": 
18 } ] }
{ "tweetid1": 19i64, "count1": 19, "t2info": [ { "tweetid2": 15i64, "count2": 
19 } ] }

And this is the optimized query plan:

distribute result [%0->$$7]
-- DISTRIBUTE_RESULT  |PARTITIONED|
  exchange 
  -- ONE_TO_ONE_EXCHANGE  |PARTITIONED|
    project ([$$7])
    -- STREAM_PROJECT  |PARTITIONED|
      assign [$$7] <- [function-call: asterix:closed-record-constructor, Args:[AString: {tweetid1}, %0->$$20, AString: {count1}, %0->$$29, AString: {t2info}, %0->$$18]]
      -- ASSIGN  |PARTITIONED|
        project ([$$18, $$20, $$29])
        -- STREAM_PROJECT  |PARTITIONED|
          exchange 
          -- SORT_MERGE_EXCHANGE [$$28(ASC) ]  |PARTITIONED|
            order (ASC, %0->$$28) 
            -- STABLE_SORT [$$28(ASC)]  |PARTITIONED|
              exchange 
              -- ONE_TO_ONE_EXCHANGE  |PARTITIONED|
                group by ([$$20 := %0->$$24]) decor ([$$28 := %0->$$24; $$29 := %0->$$25]) {
                          aggregate [$$18] <- [function-call: asterix:listify, Args:[function-call: asterix:closed-record-constructor, Args:[AString: {tweetid2}, %0->$$21, AString: {count2}, %0->$$22]]]
                          -- AGGREGATE  |LOCAL|
                            select (function-call: algebricks:not, Args:[function-call: algebricks:is-null, Args:[%0->$$23]])
                            -- STREAM_SELECT  |LOCAL|
                              nested tuple source
                              -- NESTED_TUPLE_SOURCE  |LOCAL|
                       }
                -- PRE_CLUSTERED_GROUP_BY[$$24]  |PARTITIONED|
                  exchange 
                  -- ONE_TO_ONE_EXCHANGE  |PARTITIONED|
                    order (ASC, %0->$$24) (ASC, %0->$$21) 
                    -- STABLE_SORT [$$24(ASC), $$21(ASC)]  |PARTITIONED|
                      exchange 
                      -- HASH_PARTITION_EXCHANGE [$$24]  |PARTITIONED|
                        left outer join (function-call: algebricks:eq, Args:[%0->$$25, %0->$$22])
                        -- HYBRID_HASH_JOIN [$$25][$$22]  |PARTITIONED|
                          exchange 
                          -- HASH_PARTITION_EXCHANGE [$$25]  |PARTITIONED|
                            project ([$$25, $$24])
                            -- STREAM_PROJECT  |PARTITIONED|
                              assign [$$25] <- [function-call: asterix:field-access-by-index, Args:[%0->$$0, AInt32: {6}]]
                              -- ASSIGN  |PARTITIONED|
                                project ([$$0, $$24])
                                -- STREAM_PROJECT  |PARTITIONED|
                                  exchange 
                                  -- ONE_TO_ONE_EXCHANGE  |PARTITIONED|
                                    unnest-map [$$24, $$0] <- function-call: asterix:index-search, Args:[AString: {TweetMessages}, AInt32: {0}, AString: {test}, AString: {TweetMessages}, ABoolean: {false}, ABoolean: {false}, AInt32: {0}, AInt32: {1}, %0->$$32, TRUE, FALSE, FALSE]
                                    -- BTREE_SEARCH  |PARTITIONED|
                                      exchange 
                                      -- ONE_TO_ONE_EXCHANGE  |PARTITIONED|
                                        assign [$$32] <- [AInt64: {20}]
                                        -- ASSIGN  |PARTITIONED|
                                          empty-tuple-source
                                          -- EMPTY_TUPLE_SOURCE  |PARTITIONED|
                          exchange 
                          -- HASH_PARTITION_EXCHANGE [$$22]  |PARTITIONED|
                            project ([$$21, $$23, $$22])
                            -- STREAM_PROJECT  |PARTITIONED|
                              assign [$$23, $$22] <- [TRUE, function-call: asterix:field-access-by-index, Args:[%0->$$1, AInt32: {7}]]
                              -- ASSIGN  |PARTITIONED|
                                exchange 
                                -- ONE_TO_ONE_EXCHANGE  |PARTITIONED|
                                  data-scan []<-[$$21, $$1] <- test:TweetMessages
                                  -- DATASOURCE_SCAN  |PARTITIONED|
                                    exchange 
                                    -- ONE_TO_ONE_EXCHANGE  |PARTITIONED|
                                      empty-tuple-source
                                      -- EMPTY_TUPLE_SOURCE  |PARTITIONED|

Young-Seok, have you tried this query on master? And can you compare this query 
plan with yours to see if we are doing something different.

Original comment by icetin...@gmail.com on 11 May 2014 at 4:12

GoogleCodeExporter commented 9 years ago
I will check it in master again and get back to you.

Original comment by kiss...@gmail.com on 11 May 2014 at 7:31

GoogleCodeExporter commented 9 years ago
Order by in nested plan DOES work in the master branch.
This is invalid issue.
Sorry about this.

Original comment by kiss...@gmail.com on 12 May 2014 at 11:05

GoogleCodeExporter commented 9 years ago
Reopening the issue due to newly discovered information
Error is triggered by incorrect comparison between null and non-null (integer 
in the example) values. Comparison does not check the length of the null value 
and takes the following arbitrary 4 bytes from the frame, which might trigger a 
wring order.

Original comment by ildar.absalyamov on 3 Feb 2015 at 10:30

GoogleCodeExporter commented 9 years ago

Original comment by ildar.absalyamov on 5 Feb 2015 at 4:45