osmlab / name-suggestion-index

Canonical common brand names, operators, transit and flags for OpenStreetMap.
https://nsi.guide
BSD 3-Clause "New" or "Revised" License
712 stars 868 forks source link

[canonical.json] maybe duplicated names #150

Closed ImreSamu closed 6 years ago

ImreSamu commented 6 years ago

List of (maybe) duplicated records.

( I have working on similar QA report for my pet project: dockerized-taginfo , so I just re-implemented the algorithm. )

type1 name1 (count) type2 name2 ( count)
amenity/bank ABANCA (102) amenity/bank Abanca (74)
amenity/bank Banco BCI (77) amenity/bank BCI (154)
amenity/bank Banco de Venezuela (98) amenity/bank De Venezuela (85)
amenity/bank Banco Estado (160) amenity/bank BancoEstado (134)
amenity/bank Banco Santander (180) amenity/bank Santander (3544)
amenity/bank Bank BRI (223) amenity/bank BRI (335)
amenity/bank Bankinter (167) amenity/bank Interbank (156)
amenity/bank BCA (219) amenity/bank Bank BCA (60)
amenity/bank BMCE (63) amenity/bank BMCE Bank (250)
amenity/bank BMCI Bank (59) amenity/bank BMCI (125)
amenity/bank BNI (226) amenity/bank Bank BNI (67)
amenity/bank Caixa (296) amenity/bank CaixaBank (445)
amenity/bank HBL (73) amenity/bank HBL Bank (123)
amenity/bank Ibercaja (102) amenity/bank IberCaja (229)
amenity/bank Intesa San Paolo (224) amenity/bank Intesa SanPaolo (75)
amenity/bank Lloyds Bank (445) amenity/bank Lloyds (192)
amenity/bank MCB (127) amenity/bank MCB Bank (54)
amenity/bank PNC Bank (758) amenity/bank PNC (84)
amenity/bank Popular (122) amenity/bank Banco Popular (686)
amenity/bank Postbank (588) amenity/bank Bancpost (80)
amenity/bank Sabadell (122) amenity/bank Banc Sabadell (209)
amenity/bank Sabadell (122) amenity/bank Banco Sabadell (213)
amenity/bank Standard Chartered Bank (102) amenity/bank Standard Chartered (102)
amenity/bank UBL Bank (52) amenity/bank UBL (56)
amenity/bank ПриватБанк (1060) amenity/bank Приватбанк (98)
amenity/fuel SuperAmerica (57) amenity/fuel Super America (51)
shop/car_parts NAPA Auto Parts (398) shop/car_parts Napa Auto Parts (77)
shop/car Nissan (473) shop/car NISSAN (83)
shop/carpet Carpet Right (120) shop/carpet Carpetright (68)
shop/clothes engbers (52) shop/clothes Engbers (78)
shop/clothes New Yorker (396) shop/clothes NewYorker (67)
shop/clothes Pep (143) shop/clothes PEP (51)
shop/convenience Abc (67) shop/convenience abc (427)
shop/convenience Abc (67) shop/convenience ABC (798)
shop/convenience Alimentation générale (74) shop/convenience Alimentation Générale (157)
shop/convenience AMPM (145) shop/convenience ampm (180)
shop/convenience best-one (51) shop/convenience Best-One (67)
shop/convenience COOP Jednota (409) shop/convenience Coop Jednota (133)
shop/convenience Magazin alimentar (92) shop/convenience Magazin Alimentar (80)
shop/convenience Magazin mixt (94) shop/convenience Magazin Mixt (150)
shop/convenience odido (85) shop/convenience Odido (204)
shop/convenience On The Run (56) shop/convenience On the Run (101)
shop/convenience Tesco Lotus Express (123) shop/convenience TESCO Lotus Express (63)
shop/convenience Родны Кут (62) shop/convenience Родны кут (125)
shop/cosmetics Магнит Косметик (157) shop/cosmetics Магнит косметик (86)
shop/doityourself GAMMA (84) shop/doityourself Gamma (65)
shop/hearing_aids Amplifon (180) shop/hearing_aids amplifon (56)
shop/kiosk K Kiosk (64) shop/kiosk k kiosk (60)
shop/mobile_phone mobilcom debitel (80) shop/mobile_phone Mobilcom Debitel (52)
shop/mobile_phone Tim (68) shop/mobile_phone TIM (97)
shop/pet Pets At Home (52) shop/pet Pets at Home (204)
shop/shoes Payless Shoesource (55) shop/shoes Payless Shoe Source (328)
shop/shoes Payless Shoesource (55) shop/shoes Payless ShoeSource (201)
shop/supermarket BIM (62) shop/supermarket Bim (924)
shop/supermarket Conad (618) shop/supermarket CONAD (81)
shop/supermarket COOP Jednota (198) shop/supermarket Coop Jednota (109)
shop/supermarket CRAI (76) shop/supermarket Crai (135)
shop/supermarket EuroSpin (152) shop/supermarket Eurospin (369)
shop/supermarket Norma (1178) shop/supermarket NORMA (149)
shop/supermarket Rema 1000 (474) shop/supermarket REMA 1000 (62)
shop/supermarket Shoprite (349) shop/supermarket ShopRite (83)
shop/supermarket хүнсний дэлгүүр (73) shop/supermarket Хүнсний дэлгүүр (61)
tourism/hotel Parkhotel (65) tourism/hotel Park Hotel (99)

( for audit ) this is my script

#Julia 1.0 
using JSON
bankpattern = r"^amenity/bank\|"
canonical = JSON.parsefile("canonical.json")
dcnames=Dict()
for (k,v) in canonical
    ctype=String.(split(k,'|'))[1]
    cname=String.(split(k,'|'))[2]
    cname=strip(lowercase(cname))
    cname=replace( cname, " " => "")
    if match(bankpattern,k) ≠ nothing
        cname=replace( cname, "bank" => "")
        cname=replace( cname, "banca" => "") 
        cname=replace( cname, "banco" => "")         
        cname=replace( cname, "banc" => "")   
        cname=replace( cname, "банк" => "")   
    end             
    newkey= string(ctype,"|",cname)  
    if haskey(dcnames, newkey)
        println(       dcnames[ newkey ] ,"   (",canonical[dcnames[newkey]]["count"],")"  
              ," | " , k                 ,"   (",canonical[k]["count"],")"                 
              ," |"  )
    else
        dcnames[ newkey ] = k
    end   
end
bhousel commented 6 years ago

Cool thanks @ImreSamu! The filterNames script actually produces a list of duplicates for me already, but I might work the special "bank" processing into it.

ImreSamu commented 6 years ago

The filterNames script actually produces a list of duplicates for me already

Ouch , I have not checked - sorry.

but I might work the special "bank" processing into it.

ok,

I am also removing all spaces, so this is another testcase for you: tourism/hotel | "Parkhotel" vs. "Park Hotel" ( As I see not in your list )

bhousel commented 6 years ago

Ouch , I have not checked - sorry.

No apology needed! I'm happy people are interested in this project 😄 I just need to update the README and stuff, and well, it's only Monday.

My goal is to get this repo into a state where we can ask for lots of volunteers to do the deduplication and lookup the missing brand:wikidata and brand:wikipedia tags. Hacktoberfest starts in 2 weeks, and that would be a great task for first time open source contributors.

bhousel commented 6 years ago

Duplicate catcher currently looks like this - thanks @ImreSamu for the suggestions!

// Removes noise from the name so that we can compare
// similar names for catching duplicates.
function stemmer(name) {
    var noise = [
        /ban(k|c)(a|o)?/i,
        /банк/i,
        /coop/i,
        /express/i,
        /(gas|fuel)/i,
        /\s/
    ];

    name = noise.reduce((acc, regex) => acc.replace(regex, ''), name);
    return diacritics.remove(name.toLowerCase());
}
ImreSamu commented 6 years ago

Thanks!

bigger list and now - my parkhotel example is inside: :+1: "tourism/hotel|Parkhotel" -> duplicates? -> "tourism/hotel|Park Hotel"

My goal is to get this repo into a state where we can ask for lots of volunteers ...

thanks for the info.