goccmack / gocc

Parser / Scanner Generator
Other
622 stars 48 forks source link

Looking for inspiration to solve combinatorial explosion. #58

Closed mewmew closed 7 years ago

mewmew commented 7 years ago

I've recently been playing with writing a grammar for the metadata representation of the DWARF debug information of LLVM IR.

The debug data is specified as an ordered list of key-value pairs, where most keys are optional and some are required. Take DILocation for instance, which specifies the debug data of a source code location (i.e. line, column).

// ref: ParseDILocation
//
//             TAG         TYPE
// ---------------------------------
// optional:   line        uint32
// optional:   column      uint16
// REQUIRED:   scope       MDField
// optional:   inlinedAt   MDField
DILocation
    : "!DILocation" "(" "scope:" MetadataID ")"
    | "!DILocation" "(" "scope:" MetadataID "," "inlinedAt" MetadataID ")"
    | "!DILocation" "(" "column:" int_lit "," "scope:" MetadataID ")"
    | "!DILocation" "(" "column:" int_lit "," "scope:" MetadataID "," "inlinedAt" MetadataID ")"
    | "!DILocation" "(" "line:" int_lit "," "scope:" MetadataID ")"
    | "!DILocation" "(" "line:" int_lit "," "scope:" MetadataID "," "inlinedAt" MetadataID ")"
    | "!DILocation" "(" "line:" int_lit "," "column:" int_lit "," "scope:" MetadataID ")"
    | "!DILocation" "(" "line:" int_lit "," "column:" int_lit "," "scope:" MetadataID "," "inlinedAt" MetadataID ")"
;

MetadataID : "!" int_lit ;

As you can see, the above works but since we have 3 optional fields, we need 2^3=8 production rules for DILocation. This obviously doesn't scale for other debug classes, some of which contain 9 or more optional fields, thus >= 512 production rules.

Any thoughts or ideas on how one may solve this in a cleaner fashion?

Constraints: ordered list, no duplicates, and required fields must be present.

Cheers /u

mewmew commented 7 years ago

A clean definition could look something along the lines of:

DILocation
    : "!DILocation" "(" OptLine "," OptColumn "," Scope "," OptInlinedAt ")"
;

OptLine
    : empty
    | "line:" int_lit
;

OptColumn
    : empty
    | "column:" int_lit
;

Scope
    : "scope:" MetadataID
;

OptInlinedAt
    : empty
    | "inlinedAt" MetadataID
;

However, I can't figure out how to handle the separating commas. What if two optional fields in a row are not present?

mewmew commented 7 years ago

Actually, thinking more about this specific example it seems to work out. Not pretty, but works. The reasoning is to let optional fields specify their separator in the direction of their closest required field.

DILocation
    : "!DILocation" "(" OptLine OptColumn Scope OptInlinedAt ")"
;

OptLine
    : empty
    | "line:" int_lit ","
;

OptColumn
    : empty
    | "column:" int_lit ","
;

Scope
    : "scope:" MetadataID
;

OptInlinedAt
    : empty
    | "," "inlinedAt" MetadataID
;

Now for the more tricky question, how do you deal with debug classes that contain only optional fields? For a specific example, consider DISubprogram:

// optional:   scope
// optional:   name
// optional:   linkageName
// optional:   file
// optional:   line
// optional:   type
// optional:   isLocal
// optional:   isDefinition
// optional:   scopeLine
// optional:   containingType
// optional:   virtuality
// optional:   virtualIndex
// optional:   thisAdjustment
// optional:   flags
// optional:   isOptimized
// optional:   unit
// optional:   templateParams
// optional:   declaration
// optional:   variables
// optional:   thrownTypes

The only solution I can think of requires n+1 production rules. One rule per optional key, and one for empty when none of the optional keys are set. The rationale is roughly to regard a single optional key as a required field, and let all the other optional fields include the comma separation (as done in the example above). The first two rules would thus look at follows, and there would exist 19 more such rules, one where name was a fake required field, one for linkageName, and so on. Obviously this is not the prettiest of solutions, and I'd be glad to hear if there is some clean solution I have not yet considered. Any ideas?

DISubprogram
    : "!DISubprogram" "(" ")"
    | "!DISubprogram" "(" Scope OptName OptLinkageName OptFile OptLine OptType OptIsLocal OptIsDefinition OptScopeLine OptContainingType OptVirtuality OptVirtualIndex OptThisAdjustment OptFlags OptIsOptimized OptUnit OptTemplateParams OptDeclaration OptVariables OptThrownTypes ")"
    | "!DISubprogram" "(" OptScope Name OptLinkageName OptFile OptLine OptType OptIsLocal OptIsDefinition OptScopeLine OptContainingType OptVirtuality OptVirtualIndex OptThisAdjustment OptFlags OptIsOptimized OptUnit OptTemplateParams OptDeclaration OptVariables OptThrownTypes ")"
    | ...
    | "!DISubprogram" "(" OptScope OptName OptLinkageName OptFile OptLine OptType OptIsLocal OptIsDefinition OptScopeLine OptContainingType OptVirtuality OptVirtualIndex OptThisAdjustment OptFlags OptIsOptimized OptUnit OptTemplateParams OptDeclaration OptVariables ThrownTypes ")"
;

Scope
    : "scope:" Value
;

OptName
    : empty
    | "," "name:" Value
;

OptLinkageName
    : empty
    | "," "linkageName:" Value
;

OptFile
    : empty
    | "," "file:" Value
;

OptLine
    : empty
    | "," "line:" Value
;

OptType
    : empty
    | "," "type:" Value
;

OptIsLocal
    : empty
    | "," "isLocal:" Value
;

OptIsDefinition
    : empty
    | "," "isDefinition:" Value
;

OptScopeLine
    : empty
    | "," "scopeLine:" Value
;

OptContainingType
    : empty
    | "," "containingType:" Value
;

OptVirtuality
    : empty
    | "," "virtuality:" Value
;

OptVirtualIndex
    : empty
    | "," "virtualIndex:" Value
;

OptThisAdjustment
    : empty
    | "," "thisAdjustment:" Value
;

OptFlags
    : empty
    | "," "flags:" Value
;

OptIsOptimized
    : empty
    | "," "isOptimized:" Value
;

OptUnit
    : empty
    | "," "unit:" Value
;

OptTemplateParams
    : empty
    | "," "templateParams:" Value
;

OptDeclaration
    : empty
    | "," "declaration:" Value
;

OptVariables
    : empty
    | "," "variables:" Value
;

OptThrownTypes
    : empty
    | "," "thrownTypes:" Value
;

Now, for the problem I've yet to find a good, and general solution to. Any help would be much appreciated. How do you handle cases when a debug class contains two or more required fields with optional fields in between? As an example, consider:

// optional:   a
// optional:   b
// REQUIRED:   c
// optional:   d
// optional:   e
// REQUIRED:   f
// optional:   g
// optional:   h

Specifically, how do you handle the separating commas in between c and f?

mewmew commented 7 years ago

Oh, no. Even before running this I can already envision the LR(1) collisions for any debug classes where a required field is followed by two or more optional fields; because then the look-ahead will be ',' for both of the production rules, and reducing to empty or shifting to , will produce a conflict between the two optional production rules.

Oh, why :(

mewmew commented 7 years ago

Not that it is a pretty solution, but would it be difficult to specify a look-ahead of 2 for Gocc? I can imagine that the tables would be tremendously huge, thus making it unfeasible.

mewmew commented 7 years ago

Hmm, it is a bit too tempting to allow syntactically invalid input to get a clean grammar, e.g. to allow trailing commas in between the last field and the left parenthesis.

( foo , bar , baz , )

Co-incidentally, this is how to Go grammar is defined https://golang.org/ref/spec#Parameters

Parameters     = "(" [ ParameterList [ "," ] ] ")" .

Oh, how I which the LLVM IR grammar also allowed this pragmatic syntax.

mewmew commented 7 years ago

Ok, it is obviously getting late. Allowing trailing commas would not solve the issue of parsing, only make it easier to write programs. Hmm, I need some rest. Sorry for the noise, and please let me know if anyone of you have some ideas on these problems :)

Good night.

mewmew commented 7 years ago

Alas, 16 LR(1) conflicts for the following grammar:

// ref: ParseDIGlobalVariable
//
// REQ:   name           string (cannot be empty)
// opt:   scope          MDField
// opt:   linkageName    string
// opt:   file           MDField
// opt:   line           uint32
// opt:   type           MDField
// opt:   isLocal        bool
// opt:   isDefinition   bool (default true)
// opt:   declaration    MDField
// opt:   align          uint32
DIGlobalVariable
    : "!DIGlobalVariable" "(" Name OptCommaScope OptCommaLinkageName OptCommaFile OptCommaLine OptCommaType OptCommaIsLocal OptCommaIsDefinition OptCommaDeclaration OptCommaAlign ")"
;

Name
    : "name:" string_lit
;

OptCommaScope
    : empty
    | "," Scope
;

Scope
    : "scope:" MetadataID
;

OptCommaLinkageName
    : empty
    | "," LinkageName
;

LinkageName
    : "linkageName:" string_lit
;

OptCommaFile
    : empty
    | "," File
;

File
    : "file:" MetadataID
;

OptCommaLine
    : empty
    | "," Line
;

Line
    : "line:" int_lit
;

OptCommaType
    : empty
    | "," Type
;

Type
    : "type:" MetadataID
;

OptCommaIsLocal
    : empty
    | "," IsLocal
;

IsLocal
    : "isLocal:" BoolLit
;

OptCommaIsDefinition
    : empty
    | "," IsDefinition
;

IsDefinition
    : "isDefinition:" BoolLit
;

OptCommaDeclaration
    : empty
    | "," Declaration
;

Declaration
    : "declaration:" MetadataID
;

OptCommaAlign
    : empty
    | "," Align
;

Align
    : "align:" int_lit
;

What may be a better approach?

awalterschulze commented 7 years ago

I don't know if this is useful, but I made a space preserving parser with optional parenthesis and ran into some of these problems. I ended up using a "Continue" pattern, trademark pending ;)

https://github.com/katydid/katydid/blob/master/relapse/bnf/all.bnf#L276-L297

StartOr
  : OpenParen Pattern Pipe ContinueOr CloseParen <<
    &Pattern{Or: &Or{
      OpenParen: $0.(*Keyword),
      LeftPattern: $1.(*Pattern),
      Pipe: $2.(*Keyword),
      RightPattern: $3.(*Pattern),
      CloseParen: $4.(*Keyword),
    }}, nil
  >>
  ;

ContinueOr
  : Pattern
  | ContinueOr Pipe Pattern <<
    &Pattern{Or: &Or{
      LeftPattern: $0.(*Pattern),
      Pipe: $1.(*Keyword),
      RightPattern: $2.(*Pattern),
    }}, nil
  >>
  ;

I think you can use the same idea combined with your idea.

// optional:   a
// optional:   b
// REQUIRED:   c
// optional:   d
List:  "A" "," ContinueB
    | ContinueB
    ;

ContinueB: "B" "," ContinueC
   | ContinueC
   ;

ContinueC: "C" "," "D"
   | "C"
   ;

I haven't checked for LR(1) collisions, but I was able to use the "Continue" Pattern in relapse without using -a (maximal much).

mewmew commented 7 years ago

Thanks @awalterschulze! I think the Continue™️ pattern of yours would be sufficient for solving this problem. A minor update made it possible to parse the end as well,

// optional:   a
// optional:   b
// REQUIRED:   c
// optional:   d
// optional:   e
// REQUIRED:   f
// optional:   g
// optional:   h
List
    :  "A" "," ContinueB
    | ContinueB
;

ContinueB
    : "B" "," ContinueC
    | ContinueC
;

ContinueC
    : "C" "," ContinueD
;

ContinueD
    : "D" "," ContinueE
    | ContinueE
;

ContinueE
    : "E" "," ContinueF
    | ContinueF
;

ContinueF
    : "F" "," ContinueG
    | "F"
;

ContinueG
    : "G" "," "H"
    | "G"
    | "H"
;

It was an interesting problem to think about. Thanks for your input too :) In the end, for my specific use case, it turned out to be easier to solve a generalized version of the problem, rather than including dedicated grammar for each DWARF debug class, add a grammar capable of capturing their common features. As a side-effect this approach makes the parser more extendible to future additions/alterations of DWARF metadata classes.

The generalized version looks roughly as follows. Optional fields would then be checked as a production action.

SpecializedMDNode
    : MetadataName "(" MDFields ")"
;

MDFields
    : empty
    | MDFieldList
    | MDValueList
;

MDFieldList
    : MDField
    | MDFieldList "," MDField
;

MDField
    : LabelIdent MDValue
;

MDValueList
    : MDValue
    | MDValueList "," MDValue
;

MDValue
    : int_lit
    | MetadataID
    | string_lit
    | BoolLit
    | MDFlagList
    | NullConst
    | Type Value
;

MDFlagList
    : MDFlag
    | MDFlagList "|" MDFlag
;

MDFlag
    : ident
;

Closing this issue. Really glad there are rather clean solutions to what seemed like a tricky problem.

awalterschulze commented 7 years ago

Ah yes I much prefer your generalized solution.

My pleasure.

On Fri, 20 Oct 2017, 23:03 Robin Eklind, notifications@github.com wrote:

Thanks @awalterschulze https://github.com/awalterschulze! I think the Continue™️ pattern of yours would be sufficient for solving this problem. A minor update made it possible to parse parse the end as well,

// optional: a// optional: b// REQUIRED: c// optional: d// optional: e// REQUIRED: f// optional: g// optional: h

List : "A" "," ContinueB | ContinueB ;

ContinueB : "B" "," ContinueC | ContinueC ;

ContinueC : "C" "," ContinueD ;

ContinueD : "D" "," ContinueE | ContinueE ;

ContinueE : "E" "," ContinueF | ContinueF ;

ContinueF : "F" "," ContinueG | "F" ;

ContinueG : "G" "," "H" | "G" | "H" ;

It was an interesting problem to think about. Thanks for your input too :) In the end, for my specific use case, it turned out to be easier to solve a generalized version of the problem, rather than including dedicated grammar for each DWARF debug class, add a grammar capable of capturing their common features. As a side-effect this approach makes the parser more extendible to future additions/alterations of DWARF metadata classes.

The generalized version looks roughly as follows. Optional fields would then be checked as a production action.

SpecializedMDNode : MetadataName "(" MDFields ")" ;

MDFields : empty | MDFieldList | MDValueList ;

MDFieldList : MDField | MDFieldList "," MDField ;

MDField : LabelIdent MDValue ;

MDValueList : MDValue | MDValueList "," MDValue ;

MDValue : int_lit | MetadataID | string_lit | BoolLit | MDFlagList | NullConst | Type Value ;

MDFlagList : MDFlag | MDFlagList "|" MDFlag ;

MDFlag : ident ;

Closing this issue. Really glad there are rather clean solutions to what seemed like a tricky problem.

— You are receiving this because you were mentioned.

Reply to this email directly, view it on GitHub https://github.com/goccmack/gocc/issues/58#issuecomment-338321722, or mute the thread https://github.com/notifications/unsubscribe-auth/ABvsLekg9opPjJdhXceZXdGBHOxuoC2aks5suQqcgaJpZM4P_9lb .