modularml / mojo

The Mojo Programming Language
https://docs.modular.com/mojo/manual/
Other
22.78k stars 2.57k forks source link

[BUG] [stdlib] Unicode escape sequences are interpreted as utf8 and misrepresented. #2842

Open martinvuyk opened 3 months ago

martinvuyk commented 3 months ago

Bug description

print(ord("\x85")) # prints 5 & it should not
print(ord("\x1c")) # prints 28 & it should
print(ord("\x1e")) # prints 30 & it should

\x85 or \~120 56 53 or 0x78 0x38 0x35\~ 0xC2 0x85 is NEXT LINE (NEL) character \x1c or \~92 120 49 99 or 0x5c 0x78 0x31 0x63\~ 0x1C is INFORMATION SEPARATOR FOUR character \x1e or \~92 120 49 101 or 0x5c 0x78 0x31 0x65\~ 0x1E is INFORMATION SEPARATOR TWO character

Steps to reproduce

System information

- What OS did you do install Mojo on ?
- Provide version information for Mojo by pasting the output of `mojo -v`
`mojo 2024.5.2605`
- Provide Modular CLI version by pasting the output of `modular -v`
martinvuyk commented 3 months ago

@JoeLoser saw your edit now that I came to fix the values. The page I was getting the unicode to utf conversion was absolute garbage. I'm currently using: https://www.fileformat.info/info/unicode/char/85/index.htm https://www.fileformat.info/info/unicode/char/001c/index.htm https://www.fileformat.info/info/unicode/char/1e/index.htm

so the only real problem is the \x85 sequence.

I'm sorry if this caused any waste of effort on your part.

mzaks commented 2 months ago

I just had a look at this issue, the problem is "\x85" converts to a byte array with single 0x85 byte, which is an invalid UTF-8 byte stream. I think there are 2 things we need to do:

  1. Make the ord function more robust by asserting on invalid single byte sequences.
  2. Identify what is the expected byte array for "\x85" string literal. If the string literal depicts the Unicode code point "U+0085", then the valid UTF-8 byte string should be 0xc2 0x85.

I can create a PR for point one. Point two needs further discussion.

martinvuyk commented 2 months ago

function more robust by asserting on invalid single byte sequences

Nice, that will help a lot

Identify what is the expected byte array for "\x85" string literal.

What function handles escape sequence translation to a number? because chr() handles the numeric conversion, but if it is passed a wrong number it still converts it. The main problem I see is that the \x is being interpreted as meaning literally hexadecimal value in utf8 bytes and not a unicode value. So whatever function does the translation from escape sequence to utf8 bytes has to be fixed to return 1-4 bytes of utf8 (bitshifted inside an Int value).