S-expr parser - Githubissues

ilammy commented 7 years ago

Implement s-expr parser that is able to parse a stream of tokens and construct s-exprs out of it, producing a tree of s-exprs with node locations as well as diagnostics.

Tracking productions specified by R7RS:

[x] datum
- [x] simple datum
  - [x] boolean
  - [x] number
  - [x] character
  - [x] string
  - [x] symbol
  - [x] bytevector
- [x] compound datum
  - [x] list
    - [x] ( datum* )
    - [x] ( datum+ . datum )
  - [x] vector
  - [x] abbreviation
[x] label handling
[x] comment handling
- [x] handling #; datum comments
- [x] ~~preserving comments (and whitespace?) in parse trees~~

High-level tasks:

[x] Implement the datum parser.

It will only parse tokens into tree-like structure, but it will not expand labels and abbreviations.
[x] Convert the datum tree to sexprs.

And this thing will expand labels, check them for correctness, and also expand abbreviations.

ilammy commented 7 years ago

I believe we should not keep comments in the parse tree. I have intended this for documentation comments, but then I have realised that there is already a traditional way for this: docstrings. Thus preserving this information is only necessary to be able to recreate the source code from the AST which is not really useful in itself.

ilammy commented 7 years ago

After some time and coding attempts I have finally realized that it will not be practical to resolve and check labels at the reader layer directly. The reason for this is that reader produces data, but programs are not composed of data, they are written with expressions, and only some parts of expressions are treated as data. Datum label specification is strict about their allowed usage and scoping rules, which we cannot validate in programs until we are sure which parts are data and which parts are expressions. Unfortunately, we can effectively confirm this only after performing macroexpansion. Examples follow.

The specification allows to use datum labels only in literals. However, literal expressions are not limited to (quote #1=(foo)) and '#2=(foo). For example, (some-name (bar #3=(#3#))) may or may not be a valid expression, depending on the meaning of imported symbols some-name and bar. If some-name and bar are procedures then the expression is invalid. If some-name or bar are macros then the expression may be valid if it expands into something where #3=(#3#) is placed into a quotation. We cannot tell that in parser.

The specification also restricts the scope of a datum label to outermost datum in order to allow shared datum literals. However, for same reason of macros we cannot tell which parts of a program are expressions, and which are data, and what are the extents of the outermost datum. For example, (foo (#1=(bar) #1#)) should be considered definitely valid if foo is a synonym to quote, but may be treated as invalid if foo splices #1=(bar) and #1# into different literals during expansion. Or may be invalid even if they are spliced into the same literal, depending on the implementation. (It is also possible to expand the expression into (quote (#1# #1=(bar)), which may or may not be valid again.)The spec does not seem to cover this aspect.

To sum this up, labels should be handled later, so the reader should expand abbreviations, but it should not resolve and validate labels.

ilammy / sabre

S-expr parser #6