Open misch opened 8 years ago
... One example is a Runner that has the same values for all attributes except first_name that exists in the following three versions:
I always said it and I'll say it again: Stupid people with their stupid accents >.<
This seems to work pretty reasonable right now, at least for accents. @misch do you feel like implementing another merge strategy that would merge entries such as
Paul Gaertner Paul Gärtner
?
Sure! :)
see #19
Another issue: possibly we should remove 'nationality' from the grouping/identifying attributes for runners, since it is often missing. One example is when searching for Matthias Burkhalter
We also have to think about a solution for people who moved to a new hometown... If we leave both away, we are basically only matching the name and an estimated birth date with an accuracy of +/-10y :) That seems a bit critical to me... Maybe we could introduce some kind of match-score for two users and according to this, merge them or not. This score could include problems like missing nationality or wrong accents / spacings / ect. (small penalty), completely different values in some attributes (small penalty for :club_or_hometown, huge penalty for :first_name or :last_name), ect. Do you think, such an approach could make sense...?
Well, it would certainly make sense, maybe even use some machine learning for this. But then we also face the problem that there is quite a bit of data available. One big advantage of the group query was, that it's very fast to execute. If you feel like doing the ML, I'd be very interested in seeing the results!
Rethink the merging algorithm, since there are a few cases where there are more than 2 possible ways for writing a name.