add some simple non-ascii tests

mjsottile / sfsexp

Small Fast S-Expression Library

Other

72 stars 13 forks source link

add some simple non-ascii tests #6

Closed bremner closed 3 years ago

bremner commented 3 years ago

This is "hello world" in Armenian and Yiddish, and some silly pseudoequation using math characters.

I couldn't really deduce what kind of support for utf8 encoded non-ascii text the is (or is intended to be). I decided to test a few examples. They seem to work, I guess on the basis that the delimiter characters (whitespace, parens, quotes, #) are ascii. If that sounds right maybe a note in the top level README would be appropriate.

mjsottile commented 3 years ago

I need to look and experiment a bit more with UTF-8 and the library. As originally designed, ASCII was all that was considered (and binary payloads). I did some additional experiments and it does appear to work fine with UTF-8, but I can't say for certain that there isn't an assumption baked in somewhere that would break under certain circumstances with UTF-8 characters outside the ASCII range.

bremner commented 3 years ago

Matthew Sottile @.***> writes:

I need to look and experiment a bit more with UTF-8 and the library. As originally designed, ASCII was all that was considered (and binary payloads). I did some additional experiments and it does appear to work fine with UTF-8, but I can't say for certain that there isn't an assumption baked in somewhere that would break under certain circumstances with UTF-8 characters outside the ASCII range.

That makes sense. Thanks in advance for looking into it.

mjsottile commented 3 years ago

FYI, I think there may be a subtle issue here related to the interaction of UTF8 and the val_used and val_allocated used for tracking memory usage. I believe those are incremented assuming single byte characters, which means that they may go bad when given input containing multi-byte characters. I'll need to instrument the parser to see if those values are correct or go bad when given UTF8 input containing multi-byte characters. There are also a couple instances of strncpy that need to be checked that they are doing the right thing in the presence of UTF8.

bremner commented 3 years ago

I have updated the tests (and put them in a separate file). I took the Markus Kuhn utf8 demo file and converted it to s-expressions (well, I added some parens so it wasn't just a big pile of atoms). For me valgrind does not report any errors when running the various tests (ctorture, readtests, read_and_dump) with this file as input.