Closed serhii-muchychka closed 1 year ago
Looks like another edge case like what we had in #5017 It's 2 years later so I'll play around and see whether there are newer better ways to do this.
Long-term, we shouldn’t do any manual diacritic-folding to compare strings, even with the help of the libraries in https://github.com/osmlab/name-suggestion-index/issues/5017#issuecomment-817343126. Especially since NSI tends to compare whole strings, String.prototype.localeCompare
and Intl.Compare
are a lot more robust. However, the behavior depends on the language you pass in. I guess individual entries would need to be able to specify the language name
is in, since OSM doesn’t do that?
Ok I added some fixes that will keep the generic "İnşaat Malları" from sneaking back into the index.
This was tricky because: you'd think that case insensitive regex /i would catch both upper and lower case variants of this, but it doesn't.
Then, I tried to match both variants with an exclude regex like '^(İ|i̇)nşaat malları$', but toLowerCasing that regex in our file_tree writing code was changing the 'İ'.
So for now, our build scripts can just avoid toLowerCasing a string with a 'İ' in it.
Long-term, we shouldn’t do any manual diacritic-folding to compare strings, even with the help of the libraries in https://github.com/osmlab/name-suggestion-index/issues/5017#issuecomment-817343126
@1ec5 Can you say more what you mean by this? I kind of think we do need to continue to diacritic fold the strings?
We mostly do this to catch typos in the OSM tags that we're matching.
I guess our basic use case is: if someone creates something in OSM with name=Haagen Dazs
, Rapid can suggest the tag name=Häagen-Dazs
instead. I can't think of a situation where the two locally used names would differ only by a diacritic mark.
It turns out that this is a known problem. There is an article about it in Wikipedia, maybe someone will be interested, so I leave the link here: https://en.wikipedia.org/wiki/Dotted_and_dotless_I_in_computing
I tried to do this: https://github.com/osmlab/name-suggestion-index/commit/f4e5e1d4001c96da39ad884f78563bff0d0fc87d
Steps:
npm run build
1st time - OK (The script replaces letters with lowercase ones in the new generic string)npm run build
2nd time - script re-add the brand entryIt seems that the problem relates to letter "İ"