osmlab / name-suggestion-index

Canonical common brand names, operators, transit and flags for OpenStreetMap.
https://nsi.guide
BSD 3-Clause "New" or "Revised" License
712 stars 872 forks source link

Check generic regexes #6076

Open arch0345 opened 2 years ago

arch0345 commented 2 years ago

Originally posted by @andrewpmk in https://github.com/osmlab/name-suggestion-index/pull/6073#issue-1104797862

If this interests you I created a branch https://github.com/andrewpmk/name-suggestion-index/tree/check-generic-regex (not in this pull request) where I updated your script build_index.js to warn for regexes in the generic section which don't match whole word (start with ^ and end with $). It looks like you are already using safe-regex to look for regex denial of service. Not that familiar with how this project is structured but I use iD a lot and the irish pub bug was annoying me, so finally figured out how to fix it yesterday. Not sure if we should fix any of the other regexes flagged by this script because they all match either the beginning or end of word but not both. Here are the problem generic regexes I found:

"brands/amenity/dentist" -> regex not limited to whole word -> "^стоматолог" "brands/amenity/hospital" -> regex not limited to whole word -> "^инфекционн(ая|ое) (больница|отделение)" "brands/amenity/hospital" -> regex not limited to whole word -> "^кожно-?венерологический диспансер" "brands/amenity/pharmacy" -> regex not limited to whole word -> "^аптека" "brands/shop/convenience" -> regex not limited to whole word -> "^magazin\s?(alimentar|mixt|non-stop)?" "brands/shop/convenience" -> regex not limited to whole word -> "^მარკეტი( (market))?" "brands/shop/kiosk" -> regex not limited to whole word -> "^მარკეტი( (market))?" "brands/shop/tyres" -> regex not limited to whole word -> "vulcanisateur" "operators/amenity/hospital" -> regex not limited to whole word -> "^инфекционн(ая|ое) (больница|отделение)" "operators/amenity/hospital" -> regex not limited to whole word -> "^кожно-?венерологический диспансер" "operators/amenity/prison" -> regex not limited to whole word -> "监狱管理局$"

bhousel commented 2 years ago

Nice! Yeah I agree we should generally include the beginning-of-text and end-of-text marks.

There may have been a few where I intentionally left them off, so it’s possible that changing this will cause a few generic collected words to sneak in, but we can fix these easily.

If we wanted to just make a rule that they all must have these, or even have the code add them automatically, I’d be ok with that.