inspirer / textmapper

Lexer and Parser generator
http://textmapper.org
MIT License
108 stars 25 forks source link

bug: Incorrect parent and firstChild set for leftmost and rightmost nonterminals of zero length #23

Open mewmew opened 6 years ago

mewmew commented 6 years ago

Creating a dedicated issue to track the bug identified at https://github.com/inspirer/textmapper/pull/17#issuecomment-433245585

Edit: For added context, the only hand-written files are parser.go and tree.go, both of which are mostly identical copies of the original parser.go and tree.go of the TextMapper project. Which is why this bug is interesting to fix as it pertains to TextMapper itself.

I'll add my latest debug session below, feel free to skim as it's quite long :)

The debug session below is debugging the none command, which uses the mini.tm grammar, and is invoked on the example.ll input file.

Specifically the following commands may be used to reproduce the debug session:

$ go get -u github.com/mewspring/foo/none/cmd/none
$ go get -u github.com/derekparker/delve/cmd/dlv
$ go get -u github.com/aarzilli/gdlv
$ cd $GOPATH/src/github.com/mewspring/foo/none/cmd/none
$ $GOPATH/bin/gdlv debug ../../example.ll

Cheers, Robin

Debug session

Is it correct that next should be overwritten here? Then we will skip /* empty */ nonterminals that are in between other nonterminals.

2018-10-27-010734_1920x1175_scrot

Should it really be o >= endoffset? And not o > endoffset? Now, we decrease the end, so that InstMetadata will not be part of the call instruction.

2018-10-27-011328_1920x1175_scrot

OperandBundles and InstMetadata were not assigned to have CallInst as parent. They should have been. Note, end was too small. Most likely as a cause of using if o >= offset { end-- } instead of if o > offset { end-- }

2018-10-27-011841_1920x1175_scrot

This is probably wrong, as now OperandBundles and InstMetadata won't have CallInst as parent.

2018-10-27-013545_1920x1175_scrot

This is where it goes wrong. Really wrong. As the OperandBundles and InstMetadata of the previous CallInst are added to the current CallInst.

The only reason this happens is since OperandBundles and InstMetadata are zero in length, thus offset if not enough to distinguish which parent nonterminal they belong to.

2018-10-27-013740_1920x1175_scrot

This is the incorrect firstChild. It points to InstMetadata, but should really point to

2018-10-27-013942_1920x1175_scrot

Incorrect parents have now been set for InstMetadata and OperandBundles.

2018-10-27-014217_1920x1175_scrot

No parent was set for OperandBundles and InstMetadata. They should have both been assigned the last CallInst as parent (i.e. index 18).

2018-10-27-014315_1920x1175_scrot

Since InstMetadata (at index 10) is now the firstChild of the CallInst at index 18, and since the InstMetadata does not have a next (i.e. next is 0), the CallInst will not have any other children than the incorrectly added InstMetadata child. Therefore, calling Typ on CallInst will result in NONE type.

2018-10-27-014532_1920x1175_scrot

InstMetadata (at index 17) of previous instruction being added to have RetTerm as parent. The correct InstMetadata to add to RetTerm is at index 20.

2018-10-27-014915_1920x1175_scrot

FuncMetadata incorrectly added to have FuncBody as parent.

2018-10-27-015614_1920x1175_scrot

InstMetadata incorrectly added to have FuncBody as parent (should have been part of TermRet).

2018-10-27-015730_1920x1175_scrot

Panic with unknown node type NONE, since CallInst has InstMetadata as first child.

2018-10-27-020137_1920x1175_scrot

inspirer commented 6 years ago

Yeah, this is working as expected with the given data model. The parser produces a bunch of ranges, and the AST builder looks at how they are nested to produce the correct structure. This breaks profoundly with nullable nonterminals. First of all, the parser cannot properly position a nullable nonterminal in source code, and just puts it right before the next token, consider:

  input : A d ; 
  A : b copt ;

and b d as an input. There are two possibilities for positioning copt: right after b and right before d. Textmapper always picks the latter, which leads to including trailing spaces in both copt and A. The proper positioning depends on the outer rules, and we don't know which rule is going to be applied.

There are two solutions to this problem:

The rewritten grammar should look like this:

input : A d ;
A : b c? ;

For now, I'll use this issue to work on flagging occurrences of nullable nonterminals in unwanted locations in event based-parsers.

mewmew commented 6 years ago

To avoid (forbid?) nullable nonterminals at first and last positions in a rule. This is the rule I'm currently following everywhere. Textmapper should flag when a nullable terminal is used in such a way (a TODO for me).

I see your point and I was thinking of this too. This is definitely the simplest path to take for TextMapper. It may however put additional constraints on TM grammar writers. Specifically for the grammar of LLVM IR, optional nonterminals as used everywhere. So for this grammar, they would have to be rewritten to use Fooopt or Foo? or Bar* or Bar+? instead of nullable nonterminals.

GlobalDecl uses this approach:

GlobalDecl -> GlobalDecl
   : Name=GlobalIdent '=' ... GlobalAttrs=(',' GlobalAttr)+? FuncAttrs=(',' FuncAttr)+?
;

It would break for the current implementation of FuncDef I believe as FuncMetadata could belong to either FuncHeader, FuncBody or FuncDecl. With the current implementation it would belong to FuncBody right? It should belong to FuncDef.

FuncDef -> FuncDef
    : 'define' Header=FuncHeader Metadata=FuncMetadata Body=FuncBody
;

The solution being to inline the definition of FuncMetadata wherever used.

FuncMetadata -> FuncMetadata
    : MetadataAttachments=MetadataAttachment*
;

Just to give a sense for how much LLVM IR uses optional nonterminals, below follows the definition of FuncHeader:

FuncHeader -> FuncHeader
    : (Linkage | ExternLinkage)? Preemptionopt Visibilityopt DLLStorageClassopt CallingConvopt ReturnAttrs=ReturnAttr* RetType=Type Name=GlobalIdent '(' Params ')' UnnamedAddropt AddrSpaceopt FuncAttrs=FuncAttr* Sectionopt Comdatopt GCopt Prefixopt Prologueopt Personalityopt
;

I mainly give these specific usage examples to have a concrete grammar as a reference when deciding on either approach 1) or 2). I too would like to keep things simple. Still, the intuitive approach as a user of TextMapper would be if nonterminals belong to their respective parent, as defined in the grammar. And this would be approach 2.

Kindly, Robin

inspirer commented 6 years ago

I think we can introduce a generation option and have both, with 2 being the default. With the go rewrite, we will get declarative inlining, which should make a lot of things easier.

FuncMetadata inline
    : (MetadataAttachments=MetadataAttachment+  -> FuncMetadata)?
;
mewmew commented 6 years ago

I tried to manually inline the instruction metadata attachments in the LLVM IR grammar. It works. I get unknown node type NONE before inline, and successful resolution after inlining (i.e. I'm able to invoke Typ of CallInst).

The inlining is done by removing the InstMetadata definition, and updating each instruction to make use of Metadata=(',' MetadataAttachment)+? instead.

Old:

XorInst -> XorInst
    : 'xor' X=TypeValue ',' Y=Value InstMetadata
;

InstMetadata -> InstMetadata
   : MetadataAttachments=(',' MetadataAttachment)+?
;

New:

XorInst -> XorInst
    : 'xor' X=TypeValue ',' Y=Value Metadata=(',' MetadataAttachment)+?
;

However, one point to take into consideration is that inlining InstMetadata increased the size of the generated parsing tables considerably.

Before inlining:

$ make -C asm/ll -B
lalr: 0.223s, text: 0.517s, parser: 2167 states, 184KB

After inlining:

$ make -C asm/ll -B
lalr: 0.261s, text: 0.526s, parser: 2166 states, 225KB

So, the increase in table size from this inlining alone is 22%. Just raising this to take into consideration for the implementation of the declarative inlining in the Go version of TM.

inspirer commented 5 years ago

Yes, inlining is not always for free and I agree that we should have an option to patch locations for nullable nonterminals. I'll work on this.

mewmew commented 5 years ago

Just to add to the discussion, if approach 2 tracks the parent of each non-terminal, perhaps such a solution could also allow for non-terminals to appear at different locations within a production rule. Once more, approach 1 would favour simplicity (and speed) whereas approach 2 could favour intuitive use.

Essentially, as a user of Textmapper, I'd be fine with such trade-offs. There may of course be some fundamental aspects of how the shift/reduce parser operates that makes it not possible to do this for approach 2, if so I'd be glad to learn about them.

For a concrete example of the above that hit us today, see the following grammar:

FuncDecl -> FuncDecl
    : 'declare' Metadata=MetadataAttachment* Header=FuncHeader
    | 'define' Header=FuncHeader Metadata=MetadataAttachment* Body=FuncBody
;

Which results in the following Textmapper error:

ll.tm,966: named elements must occur in the same order in all productions

Cheers, Robin