rojo-rbx / rbx-dom

Roblox DOM and (de)serialization implementation in Rust
MIT License
111 stars 43 forks source link

TextLabels with emoji in them are not created properly with build #202

Open NobleDraconian opened 3 years ago

NobleDraconian commented 3 years ago

Currently, as of 6.0.0-rc.1, any textlabel or other UI element with emoji in them are not created properly with the rojo build command. The resulting text gets garbled, with the emoji being gone.

Steps to reproduce

  1. Create a UI that has a text element in it. Insert an emoji character such as "😃". Save it to your project's repository as an rbxmx.
  2. Generate the datamodel via rojo build, then open it. Notice the garbled text: image
  3. Manually insert the rbxmx from studio. Notice that it displays correctly.
LPGhatguy commented 3 years ago

Thanks for the easy repro case! This is really funky; Roblox Studio seems to be producing what looks like standard XML. I'm not sure if it's rbx_xml's interaction with xml-rs that would be broken or if this would be an xml-rs bug. I'll keep poking around and see.

foxfabi commented 3 years ago

I think it's not a rojo issue :) Usually i create a part in Roblox Studio and save it as *.rbxmx file.

If i put an UTF-8 character into Text Value from Roblox Studio the rbxmx file contains <string name="Text">&#226;&#153;&#175;</string>.

If i replace the weird string in the *.rbxmx file with an UTF-8 char like <string name="Text">♫</string> and delete the part in Roblox Studio, it appears again after sync with the correct value ♫

LPGhatguy commented 3 years ago

We're seeing some other escaping issues with our XML library. I'm going to move this issue into rojo-rbx/rbx-dom, as this is where the problem lies. This also impacts Remodel.

Dekkonot commented 10 months ago

Alright, I have bad news. This is Roblox's fault, so our parser being swapped won't fix it and we'll have to do it manually.

The issue lies in that those &#DDD; character escapes are called "character references" and they explicitly refer to character codepoints, as per the XML standard. So what our parser is doing is expanding them accordingly.

You can actually verify this yourself: 😃 is the byte sequence 240 159 152 131 (in decimal). A quick check will show that's what studio saves too: &#240;&#159;&#152;&#131;. If we were to expand that sequence literally, it would be the correct emoji.

However, that's not what the XML standard says to do, so what we're doing instead is expanding each of those as codepoints. So:

So the total expansion is c3 b0 c2 9f c2 98 c2 83, which if you look at it in Studio is rather similar to the garbled text in the OP:

image

We can fix this by escaping characters ourselves, and we likely will. I thought I'd share the fact that this behavior is correct and Roblox is wrong though, since it's interesting.