Open horazont opened 8 years ago
This is due to the fact, that position information for non-terminals is generated from the position information of the corresponding terminal symbols (which correspond to input lexemes). No tokens, no useful position information.
Possible solutions (which will be implemented is open to discussion):
pos
here) as Position(pos.file, pos.line0, pos.col0, pos.line0, pos.col0)
, i.e. the empty span before the next token. This will probably give good results in most cases, but might result in confusing position spans if the %empty reduction ends a list – whose position span will then be extended to include trailing ignored characters.In any case, the position handling of empty productions must be special cased to behave correctly.
As a last remark, the behaviour of the position tracking is worse if the %empty reduction is not the first reduction the behaviour will be even worse. The offending code is a reference to stack[-size].pos
(and the size of an %empty reduction is zero), so they will span all the input so far.
The offending code is a reference to stack[-size].pos (and the size of an %empty reduction is zero), so they will span all the input so far.
That at least explains the results I’m seeing for {line,col}{0,1}.
I can think of more options for the %empty position, but I don’t know how hard to implement those would be:
Aside from my last suggestion, NoPosition
makes most sense to me. I doubt that much code will be using that behaviour.
In the end, I should probably simply re-write the position of the AST element coming out of the %empty
in the code for what is document
in the example.
Rewriting the position later on is not a completely safe workaround (although it works in this case), because the position of the %empty reduction is used to calculate the position span of the surrounding nodes.
As to the other possible solutions: The previous token is not readily available to the parser when it encounters an empty reduction (but we can extract the end position from the top stack element, which may already be a non-terminal, but whose span will end at the end of the previous token) also there might not be a previous token (this of course corresponds to the position 0:0-0:0, but the parser does not know the current file-name). The span between the previous and next token will most definitely result in confusing position spans for further derived objects.
The problem with NoPosition
occurs, if the position information of the reduction is accessed explicitly from code, e.g. when a compiler maps the position information generated by the parser to its own position information format while transforming the syntax tree to some other intermediate representation or if it generates error messages with position information in later steps and the error message printer encounters a NoPosition
but cannot handle it.
For the following parser definition (
syntax
):Together with a script
bar.py
:And the following input (
text
):We get the following:
As you can see, even the file name is missing from the position information for the
%empty
production. I understand that it might be difficult to get a coherent range of characters, but the file name should be available correctly.If possible,
col0 == col1
andline0 == line1
would be nice, too, but I don’t know if it makes sense.