Closed ImreSamu closed 6 years ago
Cool thanks @ImreSamu! The filterNames
script actually produces a list of duplicates for me already, but I might work the special "bank" processing into it.
The filterNames script actually produces a list of duplicates for me already
Ouch , I have not checked - sorry.
but I might work the special "bank" processing into it.
ok,
I am also removing all spaces, so this is another testcase for you:
tourism/hotel | "Parkhotel" vs. "Park Hotel"
( As I see not in your list )
Ouch , I have not checked - sorry.
No apology needed! I'm happy people are interested in this project 😄 I just need to update the README and stuff, and well, it's only Monday.
My goal is to get this repo into a state where we can ask for lots of volunteers to do the deduplication and lookup the missing brand:wikidata
and brand:wikipedia
tags. Hacktoberfest starts in 2 weeks, and that would be a great task for first time open source contributors.
Duplicate catcher currently looks like this - thanks @ImreSamu for the suggestions!
// Removes noise from the name so that we can compare
// similar names for catching duplicates.
function stemmer(name) {
var noise = [
/ban(k|c)(a|o)?/i,
/банк/i,
/coop/i,
/express/i,
/(gas|fuel)/i,
/\s/
];
name = noise.reduce((acc, regex) => acc.replace(regex, ''), name);
return diacritics.remove(name.toLowerCase());
}
Thanks!
bigger list and now - my parkhotel example is inside: :+1:
"tourism/hotel|Parkhotel" -> duplicates? -> "tourism/hotel|Park Hotel"
My goal is to get this repo into a state where we can ask for lots of volunteers ...
thanks for the info.
List of (maybe) duplicated records.
( I have working on similar QA report for my pet project: dockerized-taginfo , so I just re-implemented the algorithm. )
( for audit ) this is my script