Open martinvuyk opened 5 months ago
@JoeLoser saw your edit now that I came to fix the values. The page I was getting the unicode to utf conversion was absolute garbage. I'm currently using: https://www.fileformat.info/info/unicode/char/85/index.htm https://www.fileformat.info/info/unicode/char/001c/index.htm https://www.fileformat.info/info/unicode/char/1e/index.htm
so the only real problem is the \x85
sequence.
I'm sorry if this caused any waste of effort on your part.
I just had a look at this issue, the problem is "\x85"
converts to a byte array with single 0x85
byte, which is an invalid UTF-8 byte stream. I think there are 2 things we need to do:
ord
function more robust by asserting on invalid single byte sequences. "\x85"
string literal. If the string literal depicts the Unicode code point "U+0085", then the valid UTF-8 byte string should be 0xc2 0x85
. I can create a PR for point one. Point two needs further discussion.
function more robust by asserting on invalid single byte sequences
Nice, that will help a lot
Identify what is the expected byte array for "\x85" string literal.
What function handles escape sequence translation to a number? because chr()
handles the numeric conversion, but if it is passed a wrong number it still converts it.
The main problem I see is that the \x
is being interpreted as meaning literally hexadecimal value in utf8 bytes and not a unicode value. So whatever function does the translation from escape sequence to utf8 bytes has to be fixed to return 1-4 bytes of utf8 (bitshifted inside an Int value).
Bug description
\x85
or \~120 56 53
or0x78 0x38 0x35
\~0xC2 0x85
is NEXT LINE (NEL) character\x1c
or \~92 120 49 99
or0x5c 0x78 0x31 0x63
\~0x1C
is INFORMATION SEPARATOR FOUR character\x1e
or \~92 120 49 101
or0x5c 0x78 0x31 0x65
\~0x1E
is INFORMATION SEPARATOR TWO characterSteps to reproduce
System information