sourcegraph / gophercon-2018-liveblog

Documents how to write a great liveblog post and how to submit your post for the GopherCon 2018 liveblog hosted by Sourcegraph at https://sourcegraph.com/gophercon
1 stars 2 forks source link

How to Write a Parser in Go #19

Closed nicksnyder closed 6 years ago

nicksnyder commented 6 years ago

Presenter: Sugu Sougoumarane

Liveblogger: Nick Snyder

Sugu Sougoumarane is the co-creator of Vitess, which contains a SQL parser that is now used by many other projects. In this talk he demonstrates how to write a parser using goyacc.

How to Write a Parser in Go

Summary


Parser use cases

How to write a parser in two easy steps

image

goyacc

Why goyacc

Why not goyacc

Using goyacc

General steps:

  1. Express grammar.
  2. Grammar rules can return values and assign types.
  3. Write code to return values.
  4. Write lexical analyzer.

Goyacc is almost an exact translation of the original yacc so some of the idiosyncrasies have been inherited. For example, C programs return only 1 value: 0 for success and 1 for failure. This means you need awkward boilerplate to give values to the lexer:

%{
package jsonparser
func setResult(l yyLexer, v Result) {
  l.(*lex).result = v
}
%}
%union{
}
%start main
%%
main:
  {
    setResult(yylex, 0)
  }

Use go generate to create the actual parser.go file:

//go:generate goyacc -l -o parser.go parser.y

How to parse a phone number

Area code has three parts: area code, first part, second part.

%token D

phone:
  area part1 part2
| area '-' part1 '-' part2
area: D D D
part1: D D D
part2: D D D D

Captital letters signify tokens.

How to return values

The generated parser is just a single function that runs a state machine and uses local variables.

These variables are saved in a union data structure:

%union{
  result Result
  part string
  ch byte
}

%type <result> phone
%type <part> area part1 part2

Actions run Go code (i.e. everything inside the braces) when a rule matches. Dollar variables address a variable that is a value returned by the parser.

part2: D D D D
  {
    $$ = cat($1, $2, $3, $4)
  }

Lexing

Two things are happening concurrently during lexing:

  1. Rules are getting matched. The int returned determines which rules match.
  2. Code is getting executed depending on which rule is matched. The result value is used inside the code you write.

Sometimes lex can return the byte itself as an int. Yacc has builtin predetermined tokens so all first 127 bytes are reserved and can be returned without telling the parser you are returning them

b := l.nextb()
if unicode.IsDigit(rune(b)) {
    lval.ch = b
    return D
}
return b

How does it work?

image

Goyacc

  1. Generates a const string per symbol
  2. Defines an interface for the Lexer
  3. Defines a parse function that accepts the lex as input

Options for lexing

Lex is not code that you live in. It is code you write once and then use for a long time. Ok if the code is not clean.

Future improvements

For complicated grammars (e.g. SQL), Goyacc can generate a big result structure that is expensive to pass around. Goyacc actually assigns this structure every time there is a state transition.

C (yacc) has structure called union which efficiently packs the datastructre, but there is no equivalent in Go....except interfaces are a very close equivalent!

Unlike C union type, you can type assert an interface in Go. One limitation with using a type asserted Go interface is that it is an rvalue which means you can't assign to it.

Switching Vitess to use an interface instead of struct doubles performance, but would be a backward incompatible change to goyacc.

sougou commented 6 years ago

Thanks for doing this :).

Could you add the following as additional info:

The examples in the presentation can be found here: https://github.com/sougou/parser_tutorial. Please refer to vitess.io for more information on Vitess. PlanetScale provides enterprise support for Vitess. For more information, please visit planetscale.com.

nicksnyder commented 6 years ago

cc @ryan-blunden @attfarhan because I am about to hop on a plane

(the current post does link to Vitess in the byline and the code samples in the summary, but we could add that at the bottom)