Closed GoogleCodeExporter closed 8 years ago
I think defining a callback interface which receives (record, probability)
values would be a good first step, and, yes, passing it as a parameter to
link(). The current behaviour could be default.
Haven't time right now to see how it could be further integrated.
Original comment by lar...@gmail.com
on 7 Nov 2011 at 8:47
But isn't this callback interface just the MatchListener interface? The even
more simplistic approach could be to add a boolean parameter which determines,
whether 'match(...)' or 'matchRL(...)' should be called.
What do you think about adding a 'finished()' method to the MatchListener
interface? This could be called after all records were processed. Then we could
implement a 'BestFitMetaMatchListener' which just stores best matches per
record and finally (when 'finished' is called) calls the 'matches' methods of a
given MatchListener.
Original comment by FMitzl...@googlemail.com
on 8 Nov 2011 at 4:39
> But isn't this callback interface just the MatchListener interface?
Actually, now that I think about it you're probably right.
> The even more simplistic approach could be to add a boolean parameter which
determines,
> whether 'match(...)' or 'matchRL(...)' should be called.
Hmmm. How would this help? You lost me here.
> What do you think about adding a 'finished()' method to the MatchListener
interface?
That's already there, except it's called 'endProcessing()'. There's also an
'endRecord()' which tells you when we've finished looking for matches for a
specific record. This means you can do top-k matching and only keep the best k
matches for each record (and discard the rest before processing is over).
>Then we could implement a 'BestFitMetaMatchListener' which just stores best matches per record and finally
> (when 'finished' is called) calls the 'matches' methods of a given
MatchListener.
Looking at it more closely I think you have what you need already. Just set the
threshold low. That may cause Duke to start querying all your properties, in
which case you may want to raise the threshold a bit and set a lower
maybe-threshold. Duke will then pass all (or at least most) matches to your
listener, and together with the 'endRecord()' call you then have what you need
for top-k matching.
Should work, no?
Original comment by lar...@gmail.com
on 8 Nov 2011 at 6:56
>> The even more simplistic approach could be to add a boolean parameter which
determines,
>> whether 'match(...)' or 'matchRL(...)' should be called.
> Hmmm. How would this help? You lost me here.
If I understand the Processor's 'linkRecords' method correctly, records are
matched using the 'matchRL' method (Processor.java:185) which only calls the
registered MatchListeners for the best matching record (Processor.java:275). If
this is true, no MatchListener has access to other matching entities.
>> What do you think about adding a 'finished()' method to the MatchListener
interface?
> That's already there, except it's called 'endProcessing()'. There's also an
>'endRecord()' which tells you when we've finished looking for matches for a
> specific record. This means you can do top-k matching and only keep the best k
> matches for each record (and discard the rest before processing is over).
Thats great! I just missed that...
>>Then we could implement a 'BestFitMetaMatchListener' which just stores best
>> matches per record and finally
>> (when 'finished' is called) calls the 'matches' methods of a given
MatchListener.
> Looking at it more closely I think you have what you need already. Just set
the
> threshold low. That may cause Duke to start querying all your properties, in
which
> case you may want to raise the threshold a bit and set a lower
maybe-threshold.
> Duke will then pass all (or at least most) matches to your listener, and
together
> with the 'endRecord()' call you then have what you need for top-k matching.
> Should work, no?
I'm with you, modulo the 'matchRL'-issue mentioned above.
(Code references are relative to
http://code.google.com/p/duke/source/browse/src/main/java/no/priv/garshol/duke/P
rocessor.java?r=533fe8427a63ea575227fc407be1b609120fbed0)
Original comment by FMitzl...@googlemail.com
on 8 Nov 2011 at 8:58
> If I understand the Processor's 'linkRecords' method correctly, records are
matched using the 'matchRL'
> method (Processor.java:185) which only calls the registered MatchListeners
for the best matching record
> (Processor.java:275). If this is true, no MatchListener has access to other
matching entities.
Ah, true. Sorry. I was in a hurry. (Working on something else today.)
There is actually a bigger issue hiding here. The algorithm I have implemented
for choosing record pairs in record linkage mode is fairly primitive, and many
others exist. I read a nice overview in
http://research.microsoft.com/pubs/153478/msr-report-1to1.pdf and have been
thinking about how to make algorithms pluggable ever since.
(You don't need to read the next bit. It's really me talking to myself to think
this through.)
The trouble is that plugging in other algorithms isn't trivial, since they may
need to store considerable temporary state, and so may want to approch matching
in a different way. Looking through the paper again, Unconstrained ManyMany and
One-to-Many FirstChoice are easy (indeed, the second is what Duke does now).
The various approaches to One-to-One Entity Resolution are considerably harder.
It seems that they all need to either store the entire set of (r1, r2, conf)
tuples and then analyze it, except some that need to control the order in which
records are matched. Still, I guess this is no reason why we can't make this
pluggable.
Hang on a bit, and I'll work in what you need for this.
Original comment by lar...@gmail.com
on 8 Nov 2011 at 12:21
Okay, I reworked things so that there are internal match filters which handle
the two strategies we currently have. I got rid of the justOne parameter and
the matchRL method, so that now the code is a lot cleaner.
The linkRecords method now defaults to passing everything through, so you get
the behaviour you want. Still not sure how to make matching strategies
choosable from outside, but I'll get to that later.
Original comment by lar...@gmail.com
on 9 Nov 2011 at 5:17
Original issue reported on code.google.com by
FMitzl...@googlemail.com
on 7 Nov 2011 at 8:22