sebastianriese / pyLR1

A pure python LR(1)/LALR(1) parser generator
7 stars 3 forks source link

Position information is broken for ``%empty`` #21

Open horazont opened 8 years ago

horazont commented 8 years ago

For the following parser definition (syntax):

%lexer

\n+ %restart
A   A
B   B
X   X

%parser

document:
    foo:
        $$.sem = [$1.sem]
    document X foo:
        $$.sem = $1.sem
        $$.sem.append($3.sem)

foo:
    %empty:
        $$.sem = $$.pos
    A:
        $$.sem = $$.pos
    B:
        $$.sem = $$.sem

Together with a script bar.py:

#!/usr/bin/env python3
import pprint
import sys

from foo import Parser, Lexer

l = Lexer(open(sys.argv[1], "rb"), filename=sys.argv[1])
p = Parser(l)

result = p.Parse()

for item in result:
    print(item)

And the following input (text):

AX

We get the following:

$ python3 -m pyLRp -lL3Td -o foo.py syntax && python3 bar.py text
0 # (0, 4) A "A"
0 4 # (1, 3) X "X"
0 6 # (1, 0) X "X"
0 1 # (0, 2) X "X"
0 1 2 # (1, 2) $EOF ""
0 1 2 3 # (1, 1) $EOF ""
0 1 # (1, 5) $EOF ""
text Line 1:0 - 1:1
 Line 0:0 - 1:2

As you can see, even the file name is missing from the position information for the %empty production. I understand that it might be difficult to get a coherent range of characters, but the file name should be available correctly.

If possible, col0 == col1 and line0 == line1 would be nice, too, but I don’t know if it makes sense.

sebastianriese commented 8 years ago

This is due to the fact, that position information for non-terminals is generated from the position information of the corresponding terminal symbols (which correspond to input lexemes). No tokens, no useful position information.

Possible solutions (which will be implemented is open to discussion):

In any case, the position handling of empty productions must be special cased to behave correctly.


As a last remark, the behaviour of the position tracking is worse if the %empty reduction is not the first reduction the behaviour will be even worse. The offending code is a reference to stack[-size].pos (and the size of an %empty reduction is zero), so they will span all the input so far.

horazont commented 8 years ago

The offending code is a reference to stack[-size].pos (and the size of an %empty reduction is zero), so they will span all the input so far.

That at least explains the results I’m seeing for {line,col}{0,1}.

I can think of more options for the %empty position, but I don’t know how hard to implement those would be:

Aside from my last suggestion, NoPosition makes most sense to me. I doubt that much code will be using that behaviour.

In the end, I should probably simply re-write the position of the AST element coming out of the %empty in the code for what is document in the example.

sebastianriese commented 8 years ago

Rewriting the position later on is not a completely safe workaround (although it works in this case), because the position of the %empty reduction is used to calculate the position span of the surrounding nodes.

As to the other possible solutions: The previous token is not readily available to the parser when it encounters an empty reduction (but we can extract the end position from the top stack element, which may already be a non-terminal, but whose span will end at the end of the previous token) also there might not be a previous token (this of course corresponds to the position 0:0-0:0, but the parser does not know the current file-name). The span between the previous and next token will most definitely result in confusing position spans for further derived objects.

The problem with NoPosition occurs, if the position information of the reduction is accessed explicitly from code, e.g. when a compiler maps the position information generated by the parser to its own position information format while transforming the syntax tree to some other intermediate representation or if it generates error messages with position information in later steps and the error message printer encounters a NoPosition but cannot handle it.