Open lassik opened 3 years ago
It would be nice to have a consistent spec that can deal with non-ASCII characters.
The format should definitely permit non-ASCII characters, but IMHO not as syntactically significant. Whitespace has syntactic meaning, and it's non-trivial to keep track of all Unicode whitespace codepoints. We don't want to force all POSE implementations to ship with big Unicode tables.
@johnwcowan is a Unicode expert; any advice?
I've had some pains in dealing with different encodings. I agree on "should not be syntactically significant".
The Go spec says:
White space, formed from spaces (U+0020), horizontal tabs (U+0009), carriage returns (U+000D), and newlines (U+000A), is ignored [...]
So you can write code like fmt.Println("Hello, 世界")
but the non-ASCII characters are inside string literals.
I think Go also allows non-ASCII identifiers. If we have vertical bar symbols in POSE, IMHO we should permit non-ASCII in them.
So POSE would permit non-ASCII in:
and nowhere else. Is this reasonable?
Comments and quoted strings, yes, provided the only encoding is UTF-8. For my view on symbols, see #3.
Agreed.
In the F# code (and possibly others) we're using the host language's native char functions, which I assume are Unicode-aware.
The grammar in the current draft has
whitespace = HT | LF | VT | FF | CR | space
. The same expressed is ASCII codepoints is 0x09..0x0D (HT..CR) and 0x20 (space). To get consistent parsing across languages, we should detect these bytes explicitly.Here's an example of ASCII-only character detection from the SML code: