Parsing should be less Unicode-aware

s-expressions / pose

Portable S-expressions (POSE) spec and libs

30 stars 3 forks source link

Parsing should be less Unicode-aware #8

Open lassik opened 3 years ago

lassik commented 3 years ago

In the F# code (and possibly others) we're using the host language's native char functions, which I assume are Unicode-aware.

The grammar in the current draft has whitespace = HT | LF | VT | FF | CR | space. The same expressed is ASCII codepoints is 0x09..0x0D (HT..CR) and 0x20 (space). To get consistent parsing across languages, we should detect these bytes explicitly.

Here's an example of ASCII-only character detection from the SML code:

fun charIsWhitespace char =
    let val cc = Char.ord char in
        (cc = 0x20) orelse (cc >= 0x09 andalso cc <= 0x0D)
    end;

fun charIsAlphabetic char =
    ((char >= #"A") andalso (char <= #"Z")) orelse
    ((char >= #"a") andalso (char <= #"z"));

fun charIsNumeric char =
    ((char >= #"0") andalso (char <= #"9"));

wallymathieu commented 3 years ago

It would be nice to have a consistent spec that can deal with non-ASCII characters.

lassik commented 3 years ago

The format should definitely permit non-ASCII characters, but IMHO not as syntactically significant. Whitespace has syntactic meaning, and it's non-trivial to keep track of all Unicode whitespace codepoints. We don't want to force all POSE implementations to ship with big Unicode tables.

lassik commented 3 years ago

@johnwcowan is a Unicode expert; any advice?

wallymathieu commented 3 years ago

I've had some pains in dealing with different encodings. I agree on "should not be syntactically significant".

lassik commented 3 years ago

The Go spec says:

White space, formed from spaces (U+0020), horizontal tabs (U+0009), carriage returns (U+000D), and newlines (U+000A), is ignored [...]

So you can write code like fmt.Println("Hello, 世界") but the non-ASCII characters are inside string literals.

I think Go also allows non-ASCII identifiers. If we have vertical bar symbols in POSE, IMHO we should permit non-ASCII in them.

lassik commented 3 years ago

So POSE would permit non-ASCII in:

comments
double-quoted strings
vertical-bar symbols

and nowhere else. Is this reasonable?

johnwcowan commented 3 years ago

Comments and quoted strings, yes, provided the only encoding is UTF-8. For my view on symbols, see #3.

lassik commented 3 years ago

Agreed.