osmlab / name-suggestion-index

Canonical common brand names, operators, transit and flags for OpenStreetMap.
https://nsi.guide
BSD 3-Clause "New" or "Revised" License
713 stars 870 forks source link

Improve `match` #2606

Closed bhousel closed 5 years ago

bhousel commented 5 years ago

Possibly mentioned this on some other tickets but can't find it now.

People struggle to understand how the match property works. One way of thinking about match is : "the entries in this index are the preferred tagging and the matches are for detecting less-preferred alternatives"

It's not great that the match property just does direct string matching, so we end up having a lot of match properties for all the possible misspellings of a brand.

    "match": [
      "amenity/fast_food|Mc Donald's",
      "amenity/fast_food|Mc Donalds",
      "amenity/fast_food|McDonalds",
      "amenity/fast_food|McDonald’s",
      "amenity/restaurant|Mc Donald's",
      "amenity/restaurant|Mc Donalds",
      "amenity/restaurant|McDonald's",
      "amenity/restaurant|McDonalds",
      "amenity/restaurant|McDonald’s"
    ],

I'm thinking of splitting this up into matchTags and matchNames and having them accept regular expressions, which will run case insensitive.

    "matchTags": [
      "/amenity\/(fast_food|restaurant)/"
    ],
    "matchNames":  [
      "/^Ma?c\s?Donald(’|')?s$/"
    ]

I'm also wondering if it makes sense to add some more generic/global match rules somewhere.. e.g. make it so that:

I also want to take whatever matching code we build and export it so we can use it in iD's validator too, for detecting mistaggings.

matkoniecz commented 5 years ago

Making checks case insensitive seems to be a good idea.

I'm also wondering if it makes sense to add some more generic/global match rules somewhere.. e.g. make it so that:

If explicit nomatch would still allow to override such automatic matches - it should be OK.

I'm thinking of splitting this up into matchTags and matchNames

It makes sense, DRY is always nice. Probably there are no cases where it would be a problem.

  "/^Ma?c\s?Donald(’|')?s$/"

I am not convinced that it is more readable. From list of names it is obvious that there are no false positives, it is not immediately clear from regexp. Also, at least in theory, nonprogrammers may contribute to this project (or at least notice suspicious data). With requiring regexp the barrier for doing this will be pushed a bit higher.

Evan in case of entries with many, many common typos I think that regexp is less readable than just list of names.


Pinging @simonpoole in case that this would impact Vespucci

matkoniecz commented 5 years ago

BTW, in case that structure of output will change it may make sense to release it as 3.0 version.

bhousel commented 5 years ago

I am not convinced that it is more readable. From list of names it is obvious that there are no false positives, it is not immediately clear from regexp. Also, at least in theory, nonprogrammers may contribute to this project (or at least notice suspicious data). With requiring regexp the barrier for doing this will be pushed a bit higher.

Evan in case of entries with many, many common typos I think that regexp is less readable.

Oh yeah - I've changed my mind about this. Regexp is great but introduces more complexity than we need. People can just supply a list of strings to match.

BTW, in case that structure of output will change it may make sense to release it as 3.0 version.

Yes I will..