Closed V-FEXrt closed 3 months ago
I'm a bit concerned about byteToInteger
-- or maybe the problem's in the deserialization?
$ wake -x 'explode "[aɪ̯ pʰiː eɪ̯]" | map byteToInteger | map integerToByte'
"[", "a", "�", "�", " ", "p", "�", "i", "�", " ", "e", "�", "�", "]", Nil
$ wake -x 'explode "[aɪ̯ pʰiː eɪ̯]" | map unicodeToInteger | map integerToUnicode'
"[", "a", "ɪ", "̯", " ", "p", "ʰ", "i", "ː", " ", "e", "ɪ", "̯", "]", Nil
If that's a false negative, then it's all good, but I think we might need those single-quote raw strings that got simplified away long ago for explode
to give you the expected bytes.
You are right to be concerned, I was/am too. I tested this specific case using various emoji but I guess I didn't do a deep enough test.
There isn't really a good solution here. I have to send a stream of bytes to the server and unicodeToInteger
will very happily return values larger than a byte.
$ wake -x 'explode "[aɪ̯ pʰiː eɪ̯]" | map unicodeToInteger'
91, 97, 618, 815, 32, 112, 688, 105, 720, 32, 101, 618, 815, 93, Nil
^^^^^^^^ -- these don't fit in a byte
which will break the RSC. We might be able to use it if we could find the largest return value from unicodeToInteger
and update the RSC to use that type
Maybe if we can figure out the largest possible value that unicodeToInteger
will return but thats
Okay, I found the max size and what I need to change everything too. I'd like to do that is a separate PR after this one lands. This PR is a little weird to land then immediately change to something else but the PR brings the world into a more consistent state (that is equally broken as the previous state)
Ah actually the larger integer type doesn't work since we are just streaming bytes to files. We have to make this work using just a byte stream.
From an offline chat with Sam, turns out the issue is actually that explode
works on code points (which makes sense) so that is the core problem.
Wake is perfectly happy to reconstruct a UTF8 string from the cat
of a bunch of invalid UTF8 strings.
$ wake -x '((integerToByte 0xc3), (integerToByte 0xa7), Nil) | cat'
"ç"
Cleans up TODOS, updates the strToBytes function to use
byteToInteger
which shouldn't break serialization on non-ASCII strings