usethesource / rascal-language-servers

An LSP server for Rascal which includes an easy-to-use LSP generator for languages implemented in Rascal, and an interactive terminal REPL.
BSD 2-Clause "Simplified" License
10 stars 7 forks source link

Unicode characters are getting mangled in the REPL #326

Closed DavyLandman closed 10 months ago

DavyLandman commented 10 months ago

Describe the bug

I've noticed that unicode characters are getting mangled, it's visible when you use prettyTree but also in other cases.

in the VS Code REPL if you type in this:

rascal>"\U01F4A9"
str: "?"
---
?
---
rascal>print("\U01F4A9")
💩ok
rascal>"\u251C"
str: "├"
---
├
---
rascal>println("\u251C")
Γö£
ok

We get 2 different renderings of the same string.

and if we do it in a seperate terminal window (but with the same jar):

rascal>"\U01F4A9"
str: "💩"
---
💩
---
rascal>print("\U01F4A9")
💩ok
rascal>"\u251C"
str: "├"
---
├
---
rascal>println("\u251C")
├
ok

Strangely, for some smaller unicode chars (that fit in the lower plain), the string result printing goes okay, but calling print on that string fails.

Expected behavior Correct printing of unicode chars

Desktop (please complete the following information):

Additional context I've looked at linux, and there it's okay. Have no mac to verify this on.

DavyLandman commented 10 months ago

I just went back to an older version of rascal extension, and even 0.5.0 (almost a year old) has this issue. so it's either an old bug in rascal, or a new bug in VS Code.

So I also went back to VS Code 1.72 (September 2022) and it's also broken back then. And just to be sure, I went back another year in VS Code, and rascal. and Also there, it's not working.

So my conclusion is, we never had this working for Windows, and it's not some recent regression.

DavyLandman commented 10 months ago

It looks like emojis are a special case, as they often are in a secondary font. And VS Code is full of issues reported around those. So the 💩 not printing might be related to that (for the normal string case). See for example this issue: https://github.com/microsoft/vscode/issues/32840 and this downstream issue: https://github.com/xtermjs/xterm.js/issues/2693

But the second case, where the print is producing something very different, that looks like a rascal bug we should try to figure out.

DavyLandman commented 10 months ago

It looks like the print function is outputting utf8 chars (since is 3 utf8 bytes, and 💩 is 4 utf8 bytes) but those are not interpreted as utf8 bytes but, but in as 3 different characters in some code page/encoding. Also this behavior is different than how the result values are printed to the repl, and different how in linux it's encoded.

After some experimentation, it seems that VS Code / xterm has decided the REPL is in CP437 codepage. So we have to fix that to get around this issue.

jurgenvinju commented 10 months ago

I don't have this issue on my mac.

jurgenvinju commented 10 months ago

It's probably an effect of configuration in the context. What is your TERM variable saying?