Closed missinglink closed 3 years ago
Since reading that article I've added a second commit which strips the Zero-width Joiner symbol too.
It seems that this symbol is actually used in some Arabic and Indic scripts, so I'm happy to merge this as-is now but we might need to revisit it in the future if it's shown to cause any issues with common natural languages.
As mentioned in https://github.com/pelias/api/issues/1535, the Variation Selectors unicode block can cause errors when the proceeding codepoint has been stripped but the variation selector remains.
https://apps.timwhitlock.info/unicode/inspect?s=%E2%9D%A4%EF%B8%8F
In the example above the dingbat symbol
2764
was removed but the selectorFE0F
remained, it's a non-printable character but still causes issues when JS determines the.length
of the string.I'm not concerned about excluding these modifiers because they are currently only used to modify symbols which aren't relevant for geocoding. We're also already doing
NFKC
normalization which converts decomposed symbols to composed symbols where applicable, in this case there exists no single 'composed' symbol to convert this pair into.resolves: https://github.com/pelias/api/issues/1535 closes: https://github.com/pelias/api/pull/1536