Grammar question - Githubissues

sjhitchner commented 8 years ago

Been playing around with gocc and it's a great tool.

I have a few questions regarding how to set up the different lexical elements

I want to be able to parse out these items:

double quoted strings as string literals eg "hello world"
single quoted strings as string literals eg 'hello world'
bare strings that contain alphanumeric + '_' + '.' character as variables eg device.platform_type

I have the following lexical elements defined

_digit : '0'-'9' ;
_letter : 'a'-'z' | 'A'-'Z' ;
_alphanumeric : _letter | _digit ;
variable : _letter { _alphanumeric | '.' | '_' } ;
int_lit : _digit {_digit} ;
float_lit : _digit { '.' | _digit } ;
string_lit : '"' {.} '"' | '\'' {.} '\'';

This correctly parses double quoted strings but does not work for single quoted strings or bare strings.

What am I doing wrong?

Full grammar file for context

/* Lexical elements */
_digit : '0'-'9' ;
_letter : 'a'-'z' | 'A'-'Z' ;
_alphanumeric : _letter | _digit ;
variable : _letter { _alphanumeric | '.' | '_' } ;
int_lit : _digit {_digit} ;
float_lit : _digit { '.' | _digit } ;
string_lit : '"' {.} '"' | '\'' {.} '\'';
!whitespace : ' ' | '\t' | '\n' | '\r' ;

Expression : Expression "and" Expression      << ast.NewLogicalAnd($0, $2) >>
     | Expression "or" Expression             << ast.NewLogicalOr($0, $2) >>
     | "not" Expression                       << ast.NewNegation($1) >>
     | Term
     ;

Term : Factor ">" Factor                      << ast.NewComparisonGreaterThan($0, $2) >>
       | Factor ">=" Factor                   << ast.NewComparisonGreaterThanEquals($0, $2) >>
       | Factor "<" Factor                    << ast.NewComparisonLessThan($0, $2) >>
       | Factor "<=" Factor                   << ast.NewComparisonLessThanEquals($0, $2) >>
       | Factor "=" Factor                    << ast.NewComparisonEquals($0, $2) >>
       | Factor "==" Factor                   << ast.NewComparisonEquals($0, $2) >>
       | Factor "!=" Factor                   << ast.NewComparisonNotEquals($0, $2) >>
       | Factor "is not" Factor               << ast.NewComparisonIsNot($0, $2) >>
       | Factor "is" Factor                   << ast.NewComparisonIs($0, $2) >>
       | Factor "contains" Factor             << ast.NewComparisonContains($0, $2) >>
       | Factor "matches" Factor              << ast.NewMatches($0, $2) >>
       | Factor
       ;

Factor : int_lit                              << ast.NewLiteralInt($0) >>
       | float_lit                            << ast.NewLiteralFloat($0) >>
       | variable                             << ast.NewResolver($0) >>
       | string_lit                           << ast.NewLiteralString($0) >>  
       | "true"                               << ast.NewLiteralBool(true) >>
       | "false"                              << ast.NewLiteralBool(false) >>
       | "undefined"                          << nil, nil >>
       | "null"                               << nil, nil >>
       | "empty"                              << nil, nil >>
       | "(" Expression ")"                   << ast.NewClause($1) >>
       ;

(edited by @mewmew as the BNF grammar was otherwise distorted by Markdown formatting, resulting in panic: Error decoding rune. Lit: '', rune: 39, size1 from Gocc as '_' was replaced with '' using Markdown formatting)

mewmew commented 8 years ago

There doesn't look to be anything wrong with the lexical part of your grammar.

The following simplified grammar is capable of lexing double quoted string literals, single quoted string literals, and identifiers.

/* Lexical elements */
_digit : '0'-'9' ;
_letter : 'a'-'z' | 'A'-'Z' ;
_alphanumeric : _letter | _digit ;
variable : _letter { _alphanumeric | '.' | '_' } ;
int_lit : _digit {_digit} ;
float_lit : _digit { '.' | _digit } ;
string_lit : '"' {.} '"' | '\'' {.} '\'';
!whitespace : ' ' | '\t' | '\n' | '\r' ;

Factor
    : variable
    | string_lit
;

The following Go program (available at play.golang.org) was used to test the lexer and parser.

package main

import (
    "fmt"
    "log"
    "play/b/lexer"
    "play/b/parser"
)

func main() {
    inputs := []string{
        `"hello world"`,
        `'hello world'`,
        `hello_world`,
    }
    for _, input := range inputs {
        if err := parse(input); err != nil {
            log.Fatal(err)
        }

    }
}

func parse(input string) error {
    l := lexer.NewLexer([]byte(input))
    p := parser.NewParser()
    result, err := p.Parse(l)
    if err != nil {
        return err
    }
    fmt.Printf("%#v\n", result)
    return nil
}

The above program gives the following output for the simplifed BNF grammar.

&token.Token{Type:3, Lit:[]uint8{0x22, 0x68, 0x65, 0x6c, 0x6c, 0x6f, 0x20, 0x77, 0x6f, 0x72, 0x6c, 0x64, 0x22}, Pos:token.Pos{Offset:0, Line:1, Column:15}}
&token.Token{Type:3, Lit:[]uint8{0x27, 0x68, 0x65, 0x6c, 0x6c, 0x6f, 0x20, 0x77, 0x6f, 0x72, 0x6c, 0x64, 0x27}, Pos:token.Pos{Offset:0, Line:1, Column:15}}
&token.Token{Type:2, Lit:[]uint8{0x68, 0x65, 0x6c, 0x6c, 0x6f, 0x5f, 0x77, 0x6f, 0x72, 0x6c, 0x64}, Pos:token.Pos{Offset:0, Line:1, Column:13}}

Three tokens from two distinct token types (string_lit and variable) have been lexed successfully. Which version of Gocc are you running?

sjhitchner commented 8 years ago

Finally got some time to look at this again.

I found out what was confusing me. The string_lit is parsed out correctly but the surrounding quotes are included in the string. All my tests were assuming the quote would be stripped.

Curious, is there a way to structure the grammar so the surrounding quotes are removed?

Thanks for your reply this is a fantastic library! 💃

mewmew commented 8 years ago

Curious, is there a way to structure the grammar so the surrounding quotes are removed?

Take a look at the astx example.

From astx/ast.bnf:

Stmt : 
      id               << ast.NewStmt($0) >>
;

From astx/ast/ast.go:

func NewStmt(stmtList interface{}) (Stmt, error) {
    return Stmt(stmtList.(*token.Token).Lit), nil
}

You may easily adapt the code within the semantic action expressions (i.e. within << and >>) to unquote the string literals.

Something along the lines of:

String:
    string_lit               << ast.NewStringLit($0) >>
;

func NewStringLit(lit interface{}) (string, error) {
    return strconv.Unquote(string(lit.(*token.Token).Lit)), nil
}

sjhitchner commented 8 years ago

More questions:

Does the string "empty" have significant meaning in gocc? When I try parsing the string empty I get the following error.

Unable to build AST: Error in S0: empty(15,empty), Pos(offset=0, line=1, column=7), expected one of: $ and or not > >= < <= = == != is contains matches variable string_lit int_lit float_lit null true false undefined (

If I change the parser to use the string "dempty" instead my unit tests pass.

Based on the error message above, it appears "empty" is being removed from the symbol list.

Factor 
    : variable                             << ast.NewResolver($0) >>
    | string_lit                           << ast.NewLiteralString($0) >>
    | int_lit                              << ast.NewLiteralInt($0) >>
    | float_lit                            << ast.NewLiteralFloat($0) >>
    | "empty"                              << ast.NewNull() >>
    | "null"                               << ast.NewNull() >>
    | "true"                               << ast.NewLiteralBool(true) >>
    | "false"                              << ast.NewLiteralBool(false) >>
    | "undefined"                          << ast.NewNull() >>
    | "(" Expression ")"                   << ast.NewClause($1) >>
    ;

mewmew commented 8 years ago

empty is a reserved keyword for the empty string.

From https://en.wikipedia.org/wiki/Empty_string:

In formal language theory, the empty string is the unique string of length zero.

And from the Gocc User Guide:

A syntax body may also be empty, indicated by the reserved word, empty. This allows you to write productions with optional elements, such as:
A : B "c";
B : "b" | empty;

I'd recommend reading the Gocc User Guide going forwards, it is a treasure trove of wonderful information about formal language theory and grammars.

mewmew commented 8 years ago

Note, the reserved keyword empty should be used in production rules without any surrounding quotes. That is:

// valid
A : empty ;

// invalid
B : "empty" ;

sjhitchner commented 8 years ago

Thanks for your response, I have reviewed the PDF but I was confused about the meaning. I understood that the bare string empty was a reserved word but I thought "empty" would be treated differently. I.e. "empty" and empty would have different meanings.

I am trying to build a Go implementation of a python library that has already defined a grammar that has attached meaning to the string "empty". It appears Gocc might not be able to support this.

Also today I was trying to add support for "is" and a "is not" terms. The parser appears to be confusing the "is not" term with "not" Expression.

Expression : Expression "and" Expression      << ast.NewLogicalAnd($0, $2) >>
     | Expression "or" Expression             << ast.NewLogicalOr($0, $2) >>
     | "not" Expression                       << ast.NewNegation($1) >>
     | Term
     ;

Term : Factor ">" Factor                      << ast.NewComparisonGreaterThan($0, $2) >>
       | Factor ">=" Factor                   << ast.NewComparisonGreaterThanEquals($0, $2) >>
       | Factor "<" Factor                    << ast.NewComparisonLessThan($0, $2) >>
       | Factor "<=" Factor                   << ast.NewComparisonLessThanEquals($0, $2) >>
       | Factor "=" Factor                    << ast.NewComparisonEquals($0, $2) >>
       | Factor "==" Factor                   << ast.NewComparisonEquals($0, $2) >>
       | Factor "!=" Factor                   << ast.NewComparisonNotEquals($0, $2) >>
       | Factor "is" Factor                   << ast.NewComparisonIs($0, $2) >>
       | Factor "is not" Factor               << ast.NewComparisonIsNot($0, $2) >>
       | Factor "contains" Factor             << ast.NewComparisonContains($0, $2) >>
       | Factor "matches" Factor              << ast.NewMatches($0, $2) >>
       | Factor
       ;

Factor 
    : variable                             << ast.NewResolver($0) >>
    | string_lit                           << ast.NewLiteralString($0) >>
    | int_lit                              << ast.NewLiteralInt($0) >>
    | float_lit                            << ast.NewLiteralFloat($0) >>
    | "empty"                              << ast.NewNull() >>
    | "null"                               << ast.NewNull() >>
    | "true"                               << ast.NewLiteralBool(true) >>
    | "false"                              << ast.NewLiteralBool(false) >>
    | "undefined"                          << ast.NewNull() >>
    | "(" Expression ")"                   << ast.NewClause($1) >>
    ;

awalterschulze commented 8 years ago

Why not have

 Factor "is" "not" Factor               << ast.NewComparisonIsNot($0, $2) >>

awalterschulze commented 8 years ago

I am still thinking about empty

awalterschulze commented 8 years ago

Have you tried putting empty in the lexer as

emptyword : 'e' 'm' 'p' 't' 'y' ;

And then use that in the grammar

Factor
  ...
  | emptyword                              << ast.NewNull() >>
  ...
  ;

I would like to know whether that also breaks?

mewmew commented 8 years ago

Why not have

Factor "is" "not" Factor               << ast.NewComparisonIsNot($0, $2) >>

Another option would be to use

Factor "is" Expression                    << ast.NewComparison($0, $2) >>

Since Expression already includes "not" Expression among others.

sangisos commented 7 years ago

@mewmew Wouldn't that give possibility to have a parse conflict with the AST:

      Factor "is" Expression
      /        |             \
 Term1       "is"  Expression "and" Expression
                       /        |       \
                    Term2     "and"     Term3

and

               Expression "and" Expression
              /             |       \
  Factor "is" Expression  "and"     Term3
   /       |      \
Term1    "is"     Term2

? The not in "is not" is not the same as the not in "not" Expression.

goccmack / gocc

Grammar question #29