namhnguyen / asterixdb

Automatically exported from code.google.com/p/asterixdb
0 stars 0 forks source link

Fuzzy Join not working #154

Open GoogleCodeExporter opened 9 years ago

GoogleCodeExporter commented 9 years ago
Hyracks stabilization revision: 1723
Asterix stabilization revision: 522

Fuzzy join rule does not fire and the following query is not working, because 
of passing invalid argument to Jacard:

---
for $fbu in dataset('FacebookUsers')
for $t in dataset('TweetMessages')
let $tu := $t.user
where $tu.name ~= $fbu.name
return{
"id": $fbu.id,
"name": $fbu.name,
"similar-users":  {
                      "twitter-screenname": $tu.screen-name,
                      "twitter-name": $tu.name
                   }
}

---

Initial guess is that because of the two level access (getting screen-name from 
a user inside TweetMessages dataset) the problem occurs. As an example the 
following query, which also uses ~= works fine on the same data:

----
for $t in dataset('TweetMessages')
for $t2 in dataset('TweetMessages')
where $t2.referred-topics ~= $t.referred-topics
      and $t2.tweetid != $t.tweetid
return {
"tweet": $t,
"similar-tweets": $t2
}
----

Files for DDL statements, and sample data for two used datasets are attached to 
reproduce the bug.

Original issue reported on code.google.com by pouria.p...@gmail.com on 21 Jul 2012 at 1:40

Attachments:

GoogleCodeExporter commented 9 years ago

Original comment by pouria.p...@gmail.com on 21 Jul 2012 at 1:41

Attachments:

GoogleCodeExporter commented 9 years ago
Pouria hit the issue during benchmarking.
Issue needs to be assigned to an owner.

Original comment by pouria.p...@gmail.com on 19 Nov 2012 at 7:33

GoogleCodeExporter commented 9 years ago

Original comment by khfaraaz82 on 19 Nov 2012 at 8:38

GoogleCodeExporter commented 9 years ago
There are a bunch of issues here:

1. Our system uses "~=" in an inconsistent way. Rares' fuzzy join rules accept 
operands of string type for Jaccard (injecting word tokenizers in a rewrite). 
My rules accept only list types as arguments. In my opinion, Jaccard should 
only accept operands of type list, and leave tokenization to the user.

2. In the first query, if Rares' fuzzy join rule had properly fired and 
rewritten the plan, then the query would have succeeded. However, because the 
rule did not fire, the query is run as a NL join, and eventually correctly 
reports that Jaccard doesn't accept string operands (correct behavior in my 
opinion). The reason why the fuzzy join rule does not fire, may indeed be the 
two-level access (most likely the primary-key inference fails, and the fuzzy 
join rule bails).

The second query works because referred-topics is of type list.

In my opinion, resolving this issue means achieving the following behavior:
The first query is a user error in any case because Jaccard doesn't accept 
strings.

The solution may entail fixing Rares' fuzzy join to only accept lists as 
operands to "~=". This will be rather difficult.

Original comment by alexande...@gmail.com on 19 Nov 2012 at 9:51

GoogleCodeExporter commented 9 years ago

Original comment by vinay...@gmail.com on 7 Dec 2012 at 7:48

GoogleCodeExporter commented 9 years ago

Original comment by vinay...@gmail.com on 17 May 2013 at 8:18