Position information is broken for ``%empty``

horazont commented 8 years ago

For the following parser definition (syntax):

%lexer

\n+ %restart
A   A
B   B
X   X

%parser

document:
    foo:
        $$.sem = [$1.sem]
    document X foo:
        $$.sem = $1.sem
        $$.sem.append($3.sem)

foo:
    %empty:
        $$.sem = $$.pos
    A:
        $$.sem = $$.pos
    B:
        $$.sem = $$.sem

Together with a script bar.py:

#!/usr/bin/env python3
import pprint
import sys

from foo import Parser, Lexer

l = Lexer(open(sys.argv[1], "rb"), filename=sys.argv[1])
p = Parser(l)

result = p.Parse()

for item in result:
    print(item)

And the following input (text):

AX

We get the following:

$ python3 -m pyLRp -lL3Td -o foo.py syntax && python3 bar.py text
0 # (0, 4) A "A"
0 4 # (1, 3) X "X"
0 6 # (1, 0) X "X"
0 1 # (0, 2) X "X"
0 1 2 # (1, 2) $EOF ""
0 1 2 3 # (1, 1) $EOF ""
0 1 # (1, 5) $EOF ""
text Line 1:0 - 1:1
 Line 0:0 - 1:2

As you can see, even the file name is missing from the position information for the %empty production. I understand that it might be difficult to get a coherent range of characters, but the file name should be available correctly.

If possible, col0 == col1 and line0 == line1 would be nice, too, but I don’t know if it makes sense.

sebastianriese commented 8 years ago

This is due to the fact, that position information for non-terminals is generated from the position information of the corresponding terminal symbols (which correspond to input lexemes). No tokens, no useful position information.

Possible solutions (which will be implemented is open to discussion):

One possibility is to derive the position for %empty reductions from the position of the lookahead token (denoted pos here) as Position(pos.file, pos.line0, pos.col0, pos.line0, pos.col0), i.e. the empty span before the next token. This will probably give good results in most cases, but might result in confusing position spans if the %empty reduction ends a list – whose position span will then be extended to include trailing ignored characters.
Rather than giving some arbitrary empty position span (if there are ignored characters, e.g. whitespace, in the input around the tokens, then the correct position of an %empty non-terminal is not clear), the automatic position of non-terminals deriving from empty should be set to NoPosition, which is a neutral element with respect to position span concatenation. Of course code external code would have to handle NoPosition correctly (thereby breaking the API).

In any case, the position handling of empty productions must be special cased to behave correctly.

As a last remark, the behaviour of the position tracking is worse if the %empty reduction is not the first reduction the behaviour will be even worse. The offending code is a reference to stack[-size].pos (and the size of an %empty reduction is zero), so they will span all the input so far.

horazont commented 8 years ago

The offending code is a reference to stack[-size].pos (and the size of an %empty reduction is zero), so they will span all the input so far.

That at least explains the results I’m seeing for {line,col}{0,1}.

I can think of more options for the %empty position, but I don’t know how hard to implement those would be:

Use the position of the previous token (but that’s ambiguous).
Use the span between the position of the previous token and the position of the lookahead token. This may be an empty span, which would then be line0 == line1 == lookahead_pos.line0 and col0 == col1 == lookahead_pos.col0.

Aside from my last suggestion, NoPosition makes most sense to me. I doubt that much code will be using that behaviour.

In the end, I should probably simply re-write the position of the AST element coming out of the %empty in the code for what is document in the example.

sebastianriese commented 8 years ago

Rewriting the position later on is not a completely safe workaround (although it works in this case), because the position of the %empty reduction is used to calculate the position span of the surrounding nodes.

As to the other possible solutions: The previous token is not readily available to the parser when it encounters an empty reduction (but we can extract the end position from the top stack element, which may already be a non-terminal, but whose span will end at the end of the previous token) also there might not be a previous token (this of course corresponds to the position 0:0-0:0, but the parser does not know the current file-name). The span between the previous and next token will most definitely result in confusing position spans for further derived objects.

The problem with NoPosition occurs, if the position information of the reduction is accessed explicitly from code, e.g. when a compiler maps the position information generated by the parser to its own position information format while transforming the syntax tree to some other intermediate representation or if it generates error messages with position information in later steps and the error message printer encounters a NoPosition but cannot handle it.

sebastianriese / pyLR1

Position information is broken for ``%empty`` #21