Impossible to exclude "İnşaat Malları" as generic

osmlab / name-suggestion-index

Canonical common brand names, operators, transit and flags for OpenStreetMap.

https://nsi.guide

BSD 3-Clause "New" or "Revised" License

713 stars 867 forks source link

Impossible to exclude "İnşaat Malları" as generic #8261

Closed serhii-muchychka closed 1 year ago

serhii-muchychka commented 1 year ago

I tried to do this: https://github.com/osmlab/name-suggestion-index/commit/f4e5e1d4001c96da39ad884f78563bff0d0fc87d

Steps:

I added "^İnşaat Malları$" to generics in the appropriate file
Deleted "İnşaat Malları" brand entry
run npm run build 1st time - OK (The script replaces letters with lowercase ones in the new generic string)
run npm run build 2nd time - script re-add the brand entry

It seems that the problem relates to letter "İ"

bhousel commented 1 year ago

Looks like another edge case like what we had in #5017 It's 2 years later so I'll play around and see whether there are newer better ways to do this.

1ec5 commented 1 year ago

Long-term, we shouldn’t do any manual diacritic-folding to compare strings, even with the help of the libraries in https://github.com/osmlab/name-suggestion-index/issues/5017#issuecomment-817343126. Especially since NSI tends to compare whole strings, String.prototype.localeCompare and Intl.Compare are a lot more robust. However, the behavior depends on the language you pass in. I guess individual entries would need to be able to specify the language name is in, since OSM doesn’t do that?

bhousel commented 1 year ago

Ok I added some fixes that will keep the generic "İnşaat Malları" from sneaking back into the index.

This was tricky because: you'd think that case insensitive regex /i would catch both upper and lower case variants of this, but it doesn't.

Then, I tried to match both variants with an exclude regex like '^(İ|i̇)nşaat malları$', but toLowerCasing that regex in our file_tree writing code was changing the 'İ'.

So for now, our build scripts can just avoid toLowerCasing a string with a 'İ' in it.

bhousel commented 1 year ago

Long-term, we shouldn’t do any manual diacritic-folding to compare strings, even with the help of the libraries in https://github.com/osmlab/name-suggestion-index/issues/5017#issuecomment-817343126

@1ec5 Can you say more what you mean by this? I kind of think we do need to continue to diacritic fold the strings?
We mostly do this to catch typos in the OSM tags that we're matching.

I guess our basic use case is: if someone creates something in OSM with name=Haagen Dazs , Rapid can suggest the tag name=Häagen-Dazs instead. I can't think of a situation where the two locally used names would differ only by a diacritic mark.

serhii-muchychka commented 1 year ago

It turns out that this is a known problem. There is an article about it in Wikipedia, maybe someone will be interested, so I leave the link here: https://en.wikipedia.org/wiki/Dotted_and_dotless_I_in_computing