microsoft / Power-Fx

Power Fx low-code programming language
MIT License
3.21k stars 327 forks source link

Incorrect results for string functions on args with higher-plane characters #719

Open beyackle2 opened 2 years ago

beyackle2 commented 2 years ago

Some functions are improperly treating strings that contain higher-plane Unicode characters (i.e., those with code points at U+10000 and higher, including most emoji) as if each of those characters were two characters long and unintelligible.

Split("🦓🦊🐺𐊀","")Table({Value:"�"},{Value:"�"},{Value:"�"},{Value:"�"},{Value:"�"},{Value:"�"},{Value:"�"},{Value:"�"}) (Should be Table({Value:"🦓"},{Value:"🦊"},{Value:"🐺"},{Value:"𐊀"}))

Left("xyz🅰🅱🅲", 6)"xyz🅰�" (note the ending character which is apparently half of the 🅱 emoji; this should just evaluate to the same string as was passed in)

Similarly, Right("🅰🅱🅲def", 6)"�🅲def", and Mid("🅰🅱🅲def", 2, 4)"�🅱�".

I believe something is going awry with how PowerFx is handling characters wider than 16 bits, and strings aren't being kept in a consistent translation format, which is leading to these errors.

marclundgren commented 2 years ago

Most (all?) emojis are 2 characters rather than one, according to javascript.

'🌻'.length // 2
beyackle2 commented 2 years ago

In a UTF-16 string, which is what JS uses internally, it does take two 16-bit "characters" to make a single code point from a higher plane. It can be even more than that; emoji like 👩🏾‍💻 which are formed from a base character, a skin-tone, a zero-width joiner, and another emoji, are seven whole 16-bit "characters" wide (the base, skin-tone, and following emoji count as 2 each, and the ZWJ is the additional 1), but they consist of 4 real code points and display as a single unit. I'm arguing that the intuitive representation here should be what PowerFx uses; both 🌻 and 👩🏾‍💻 should be treated as 1 character each, both to conform to what a user would expect functions like Split or Left to do and to prevent characters from being improperly split.

MikeStall commented 2 years ago

A workaround is to just split on a different character, such as a comma inbetween items:

Split("🦓,🦊,🐺,𐊀",",")

Returns: Table({Value:"🦓"},{Value:"🦊"},{Value:"🐺"},{Value:"𐊀"})