Open lassik opened 3 years ago
One way would be to have a pose-format that represents the AST so that we could have a simpler multi language test suite. Say for:
(symbol \"value\")
you would get the output:
((symbol \"symbol\") (string \"value\"))
?
That would necessitate adding extra code to the POSE writer in each library to write out the AST representation (or to construct a meta-level representation of a POSE expression that has been read in, i.e. a mapping from Exp to Exp).
The draw of S-expressions is that they can be their own AST; the mapping from S-exp to AST is 1:1. It'll be easier to write test data that covers all the data types we support. I expect most bugs to be in edge cases about what characters are allowed to be part of symbols, what counts as whitespace, etc; the big picture (which datums are contained in a file, and how they are nested in each other) is reasonably easy to get right.
Unicode handling is another place where bugs easily lurk. BOM (byte order mark), UTF-8 vs UTF-16, normalization forms (NFC vs NFD), non-ASCII whitespace, etc. And some string types (e.g. in Go) are byte strings internally, whereas others (.NET and JVM) are UTF-16, and still others (Gambit Scheme) are UTF-32.
The most esoteric test cases should probably be written as hex dumps... The raw bytes would easily get mangled by text editors and other tools that try to clean up the file.
We should probably collect a few POSE files for use as test inputs (and verify that writing them produces the same encoding in an agreed-upon normal form).