Open beyackle2 opened 2 years ago
Most (all?) emojis are 2 characters rather than one, according to javascript.
'🌻'.length // 2
In a UTF-16 string, which is what JS uses internally, it does take two 16-bit "characters" to make a single code point from a higher plane. It can be even more than that; emoji like 👩🏾💻 which are formed from a base character, a skin-tone, a zero-width joiner, and another emoji, are seven whole 16-bit "characters" wide (the base, skin-tone, and following emoji count as 2 each, and the ZWJ is the additional 1), but they consist of 4 real code points and display as a single unit. I'm arguing that the intuitive representation here should be what PowerFx uses; both 🌻 and 👩🏾💻 should be treated as 1 character each, both to conform to what a user would expect functions like Split
or Left
to do and to prevent characters from being improperly split.
A workaround is to just split on a different character, such as a comma inbetween items:
Split("🦓,🦊,🐺,𐊀",",")
Returns:
Table({Value:"🦓"},{Value:"🦊"},{Value:"🐺"},{Value:"𐊀"})
Some functions are improperly treating strings that contain higher-plane Unicode characters (i.e., those with code points at U+10000 and higher, including most emoji) as if each of those characters were two characters long and unintelligible.
Split("🦓🦊🐺𐊀","")
→Table({Value:"�"},{Value:"�"},{Value:"�"},{Value:"�"},{Value:"�"},{Value:"�"},{Value:"�"},{Value:"�"})
(Should beTable({Value:"🦓"},{Value:"🦊"},{Value:"🐺"},{Value:"𐊀"})
)Left("xyz🅰🅱🅲", 6)
→"xyz🅰�"
(note the ending character which is apparently half of the 🅱 emoji; this should just evaluate to the same string as was passed in)Similarly,
Right("🅰🅱🅲def", 6)
→"�🅲def"
, andMid("🅰🅱🅲def", 2, 4)
→"�🅱�"
.I believe something is going awry with how PowerFx is handling characters wider than 16 bits, and strings aren't being kept in a consistent translation format, which is leading to these errors.