Closed bremner closed 3 years ago
I need to look and experiment a bit more with UTF-8 and the library. As originally designed, ASCII was all that was considered (and binary payloads). I did some additional experiments and it does appear to work fine with UTF-8, but I can't say for certain that there isn't an assumption baked in somewhere that would break under certain circumstances with UTF-8 characters outside the ASCII range.
Matthew Sottile @.***> writes:
I need to look and experiment a bit more with UTF-8 and the library. As originally designed, ASCII was all that was considered (and binary payloads). I did some additional experiments and it does appear to work fine with UTF-8, but I can't say for certain that there isn't an assumption baked in somewhere that would break under certain circumstances with UTF-8 characters outside the ASCII range.
That makes sense. Thanks in advance for looking into it.
d
FYI, I think there may be a subtle issue here related to the interaction of UTF8 and the val_used and val_allocated used for tracking memory usage. I believe those are incremented assuming single byte characters, which means that they may go bad when given input containing multi-byte characters. I'll need to instrument the parser to see if those values are correct or go bad when given UTF8 input containing multi-byte characters. There are also a couple instances of strncpy that need to be checked that they are doing the right thing in the presence of UTF8.
I have updated the tests (and put them in a separate file). I took the Markus Kuhn utf8 demo file and converted it to s-expressions (well, I added some parens so it wasn't just a big pile of atoms). For me valgrind does not report any errors when running the various tests (ctorture, readtests, read_and_dump) with this file as input.
This is "hello world" in Armenian and Yiddish, and some silly pseudoequation using math characters.
I couldn't really deduce what kind of support for utf8 encoded non-ascii text the is (or is intended to be). I decided to test a few examples. They seem to work, I guess on the basis that the delimiter characters (whitespace, parens, quotes, #) are ascii. If that sounds right maybe a note in the top level README would be appropriate.