pelias / api

HTTP API for Pelias Geocoder
http://pelias.io
MIT License
221 stars 162 forks source link

unicode: remove "variation selector" and "zero-width joiner" symbols #1537

Closed missinglink closed 3 years ago

missinglink commented 3 years ago

As mentioned in https://github.com/pelias/api/issues/1535, the Variation Selectors unicode block can cause errors when the proceeding codepoint has been stripped but the variation selector remains.

https://apps.timwhitlock.info/unicode/inspect?s=%E2%9D%A4%EF%B8%8F

Screenshot 2021-06-21 at 13 49 04

In the example above the dingbat symbol 2764 was removed but the selector FE0F remained, it's a non-printable character but still causes issues when JS determines the .length of the string.

String.fromCharCode(0xFE0F).length
1

I'm not concerned about excluding these modifiers because they are currently only used to modify symbols which aren't relevant for geocoding. We're also already doing NFKC normalization which converts decomposed symbols to composed symbols where applicable, in this case there exists no single 'composed' symbol to convert this pair into.

They are currently used to specify standardized variation sequences for mathematical symbols, emoji symbols, 'Phags-pa letters, and CJK unified ideographs corresponding to CJK compatibility ideographs.

resolves: https://github.com/pelias/api/issues/1535 closes: https://github.com/pelias/api/pull/1536

missinglink commented 3 years ago

interesting article https://blog.emojipedia.org/emoji-zwj-sequences-three-letters-many-possibilities/

Screenshot 2021-06-21 at 14 10 09
missinglink commented 3 years ago

Since reading that article I've added a second commit which strips the Zero-width Joiner symbol too.

It seems that this symbol is actually used in some Arabic and Indic scripts, so I'm happy to merge this as-is now but we might need to revisit it in the future if it's shown to cause any issues with common natural languages.