misch / gpCollect

1 stars 0 forks source link

Improve Merging of duplicates #11

Open misch opened 8 years ago

misch commented 8 years ago

Rethink the merging algorithm, since there are a few cases where there are more than 2 possible ways for writing a name.

misch commented 8 years ago

... One example is a Runner that has the same values for all attributes except first_name that exists in the following three versions:

panmari commented 8 years ago

I always said it and I'll say it again: Stupid people with their stupid accents >.<

panmari commented 8 years ago

This seems to work pretty reasonable right now, at least for accents. @misch do you feel like implementing another merge strategy that would merge entries such as

Paul Gaertner Paul Gärtner

?

misch commented 8 years ago

Sure! :)

misch commented 8 years ago

see #19

panmari commented 8 years ago

Another issue: possibly we should remove 'nationality' from the grouping/identifying attributes for runners, since it is often missing. One example is when searching for Matthias Burkhalter

misch commented 8 years ago

We also have to think about a solution for people who moved to a new hometown... If we leave both away, we are basically only matching the name and an estimated birth date with an accuracy of +/-10y :) That seems a bit critical to me... Maybe we could introduce some kind of match-score for two users and according to this, merge them or not. This score could include problems like missing nationality or wrong accents / spacings / ect. (small penalty), completely different values in some attributes (small penalty for :club_or_hometown, huge penalty for :first_name or :last_name), ect. Do you think, such an approach could make sense...?

panmari commented 8 years ago

Well, it would certainly make sense, maybe even use some machine learning for this. But then we also face the problem that there is quite a bit of data available. One big advantage of the group query was, that it's very fast to execute. If you feel like doing the ML, I'd be very interested in seeing the results!