[canonical.json] maybe duplicated names

ImreSamu commented 6 years ago

List of (maybe) duplicated records.

( I have working on similar QA report for my pet project: dockerized-taginfo , so I just re-implemented the algorithm. )

type1	name1 (count)	type2	name2 ( count)
amenity/bank	ABANCA (102)	amenity/bank	Abanca (74)
amenity/bank	Banco BCI (77)	amenity/bank	BCI (154)
amenity/bank	Banco de Venezuela (98)	amenity/bank	De Venezuela (85)
amenity/bank	Banco Estado (160)	amenity/bank	BancoEstado (134)
amenity/bank	Banco Santander (180)	amenity/bank	Santander (3544)
amenity/bank	Bank BRI (223)	amenity/bank	BRI (335)
amenity/bank	Bankinter (167)	amenity/bank	Interbank (156)
amenity/bank	BCA (219)	amenity/bank	Bank BCA (60)
amenity/bank	BMCE (63)	amenity/bank	BMCE Bank (250)
amenity/bank	BMCI Bank (59)	amenity/bank	BMCI (125)
amenity/bank	BNI (226)	amenity/bank	Bank BNI (67)
amenity/bank	Caixa (296)	amenity/bank	CaixaBank (445)
amenity/bank	HBL (73)	amenity/bank	HBL Bank (123)
amenity/bank	Ibercaja (102)	amenity/bank	IberCaja (229)
amenity/bank	Intesa San Paolo (224)	amenity/bank	Intesa SanPaolo (75)
amenity/bank	Lloyds Bank (445)	amenity/bank	Lloyds (192)
amenity/bank	MCB (127)	amenity/bank	MCB Bank (54)
amenity/bank	PNC Bank (758)	amenity/bank	PNC (84)
amenity/bank	Popular (122)	amenity/bank	Banco Popular (686)
amenity/bank	Postbank (588)	amenity/bank	Bancpost (80)
amenity/bank	Sabadell (122)	amenity/bank	Banc Sabadell (209)
amenity/bank	Sabadell (122)	amenity/bank	Banco Sabadell (213)
amenity/bank	Standard Chartered Bank (102)	amenity/bank	Standard Chartered (102)
amenity/bank	UBL Bank (52)	amenity/bank	UBL (56)
amenity/bank	ПриватБанк (1060)	amenity/bank	Приватбанк (98)
amenity/fuel	SuperAmerica (57)	amenity/fuel	Super America (51)
shop/car_parts	NAPA Auto Parts (398)	shop/car_parts	Napa Auto Parts (77)
shop/car	Nissan (473)	shop/car	NISSAN (83)
shop/carpet	Carpet Right (120)	shop/carpet	Carpetright (68)
shop/clothes	engbers (52)	shop/clothes	Engbers (78)
shop/clothes	New Yorker (396)	shop/clothes	NewYorker (67)
shop/clothes	Pep (143)	shop/clothes	PEP (51)
shop/convenience	Abc (67)	shop/convenience	abc (427)
shop/convenience	Abc (67)	shop/convenience	ABC (798)
shop/convenience	Alimentation générale (74)	shop/convenience	Alimentation Générale (157)
shop/convenience	AMPM (145)	shop/convenience	ampm (180)
shop/convenience	best-one (51)	shop/convenience	Best-One (67)
shop/convenience	COOP Jednota (409)	shop/convenience	Coop Jednota (133)
shop/convenience	Magazin alimentar (92)	shop/convenience	Magazin Alimentar (80)
shop/convenience	Magazin mixt (94)	shop/convenience	Magazin Mixt (150)
shop/convenience	odido (85)	shop/convenience	Odido (204)
shop/convenience	On The Run (56)	shop/convenience	On the Run (101)
shop/convenience	Tesco Lotus Express (123)	shop/convenience	TESCO Lotus Express (63)
shop/convenience	Родны Кут (62)	shop/convenience	Родны кут (125)
shop/cosmetics	Магнит Косметик (157)	shop/cosmetics	Магнит косметик (86)
shop/doityourself	GAMMA (84)	shop/doityourself	Gamma (65)
shop/hearing_aids	Amplifon (180)	shop/hearing_aids	amplifon (56)
shop/kiosk	K Kiosk (64)	shop/kiosk	k kiosk (60)
shop/mobile_phone	mobilcom debitel (80)	shop/mobile_phone	Mobilcom Debitel (52)
shop/mobile_phone	Tim (68)	shop/mobile_phone	TIM (97)
shop/pet	Pets At Home (52)	shop/pet	Pets at Home (204)
shop/shoes	Payless Shoesource (55)	shop/shoes	Payless Shoe Source (328)
shop/shoes	Payless Shoesource (55)	shop/shoes	Payless ShoeSource (201)
shop/supermarket	BIM (62)	shop/supermarket	Bim (924)
shop/supermarket	Conad (618)	shop/supermarket	CONAD (81)
shop/supermarket	COOP Jednota (198)	shop/supermarket	Coop Jednota (109)
shop/supermarket	CRAI (76)	shop/supermarket	Crai (135)
shop/supermarket	EuroSpin (152)	shop/supermarket	Eurospin (369)
shop/supermarket	Norma (1178)	shop/supermarket	NORMA (149)
shop/supermarket	Rema 1000 (474)	shop/supermarket	REMA 1000 (62)
shop/supermarket	Shoprite (349)	shop/supermarket	ShopRite (83)
shop/supermarket	хүнсний дэлгүүр (73)	shop/supermarket	Хүнсний дэлгүүр (61)
tourism/hotel	Parkhotel (65)	tourism/hotel	Park Hotel (99)

( for audit ) this is my script

#Julia 1.0 
using JSON
bankpattern = r"^amenity/bank\|"
canonical = JSON.parsefile("canonical.json")
dcnames=Dict()
for (k,v) in canonical
    ctype=String.(split(k,'|'))[1]
    cname=String.(split(k,'|'))[2]
    cname=strip(lowercase(cname))
    cname=replace( cname, " " => "")
    if match(bankpattern,k) ≠ nothing
        cname=replace( cname, "bank" => "")
        cname=replace( cname, "banca" => "") 
        cname=replace( cname, "banco" => "")         
        cname=replace( cname, "banc" => "")   
        cname=replace( cname, "банк" => "")   
    end             
    newkey= string(ctype,"|",cname)  
    if haskey(dcnames, newkey)
        println(       dcnames[ newkey ] ,"   (",canonical[dcnames[newkey]]["count"],")"  
              ," | " , k                 ,"   (",canonical[k]["count"],")"                 
              ," |"  )
    else
        dcnames[ newkey ] = k
    end   
end

bhousel commented 6 years ago

Cool thanks @ImreSamu! The filterNames script actually produces a list of duplicates for me already, but I might work the special "bank" processing into it.

ImreSamu commented 6 years ago

The filterNames script actually produces a list of duplicates for me already

Ouch , I have not checked - sorry.

but I might work the special "bank" processing into it.

ok,

I am also removing all spaces, so this is another testcase for you: tourism/hotel | "Parkhotel" vs. "Park Hotel" ( As I see not in your list )

bhousel commented 6 years ago

Ouch , I have not checked - sorry.

No apology needed! I'm happy people are interested in this project 😄 I just need to update the README and stuff, and well, it's only Monday.

My goal is to get this repo into a state where we can ask for lots of volunteers to do the deduplication and lookup the missing brand:wikidata and brand:wikipedia tags. Hacktoberfest starts in 2 weeks, and that would be a great task for first time open source contributors.

bhousel commented 6 years ago

Duplicate catcher currently looks like this - thanks @ImreSamu for the suggestions!

// Removes noise from the name so that we can compare
// similar names for catching duplicates.
function stemmer(name) {
    var noise = [
        /ban(k|c)(a|o)?/i,
        /банк/i,
        /coop/i,
        /express/i,
        /(gas|fuel)/i,
        /\s/
    ];

    name = noise.reduce((acc, regex) => acc.replace(regex, ''), name);
    return diacritics.remove(name.toLowerCase());
}

ImreSamu commented 6 years ago

Thanks!

bigger list and now - my parkhotel example is inside: :+1: "tourism/hotel|Parkhotel" -> duplicates? -> "tourism/hotel|Park Hotel"

My goal is to get this repo into a state where we can ask for lots of volunteers ...

thanks for the info.

osmlab / name-suggestion-index

[canonical.json] maybe duplicated names #150