timtadh / lexmachine

Lex machinary for go.
Other
405 stars 28 forks source link

DFA Backend #7

Closed timtadh closed 6 years ago

timtadh commented 6 years ago

This is a new approach for creating a DFA backend for lexmachine. It generates the DFA directly from the regular expression after desugaring the Range AST nodes to Alternations + Characters. This approach creates a smaller initial DFA than the generation from the NFA byte code while also dealing correctly with Range transitions. Ranges have to be desugared or handled with care otherwise incorrect DFAs will be generated when their are overlapping ranges which (should) transition to the same DFA state. This implementation is based off of Algorithm 3.36 in the "Dragon Book."

timtadh commented 6 years ago

@omegascorp

Any chance you want to test the new DFA backend? Just replace lexer.Compile() with lexer.CompileDFA() and it should work. The DFA backend provides O(n) matching at a higher memory and initial compilation cost. It is better if you generate a lexer once and then use it over and over again on different strings.

omegascorp commented 6 years ago

@timtadh tried to replace lexer.Compile() with lexer.CompileDFA() got an error:

If it will help I can also provide the list of the RegEx I'm adding to the lexer

panic: Access out of bounds. len(*List) = 0, idx = 0
goroutine 1 [running]:
github.com/timtadh/data-structures/errors.Errorf(0x5a635c, 0x2f, 0xc420059200, 0x2, 0x2, 0x675980, 0x0)
    /home/omegascorp/gocode/src/github.com/timtadh/data-structures/errors/errors.go:23 +0x82
github.com/timtadh/data-structures/list.(*List).Get(0xc4206d03f0, 0x0, 0x65fc80, 0xc4206d03f0, 0x0, 0x0)
    /home/omegascorp/gocode/src/github.com/timtadh/data-structures/list/array_list.go:285 +0xfa
github.com/timtadh/data-structures/list.(*Sorted).Get(0xc4206d03f0, 0x0, 0x65fc80, 0xc4206d03f0, 0x0, 0x0)
    /home/omegascorp/gocode/src/github.com/timtadh/data-structures/list/sorted.go:135 +0x37
github.com/timtadh/lexmachine/dfa.(*DFA).minimize(0xc4200820f0, 0x0)
    /home/omegascorp/gocode/src/github.com/timtadh/lexmachine/dfa/gen.go:309 +0x13a1
github.com/timtadh/lexmachine/dfa.Generate(0x65fd00, 0xc4200bd900, 0xc4200bd900)
    /home/omegascorp/gocode/src/github.com/timtadh/lexmachine/dfa/gen.go:130 +0x1b87
github.com/timtadh/lexmachine.(*Lexer).CompileDFA(0xc420082c80, 0xc4200961d8, 0xc42005be08)
    /home/omegascorp/gocode/src/github.com/timtadh/lexmachine/lexer.go:302 +0x7d
codetrace-analyzer/services/lexerService.getLexer(0xc4200128a0, 0xa, 0x500000000000000, 0xffffffffffffffff, 0x1f)
    /home/omegascorp/gocode/src/codetrace-analyzer/services/lexerService/lexerService.go:44 +0x66
timtadh commented 6 years ago

@omegascorp That would be great! Thank you for testing!!

omegascorp commented 6 years ago

@timtadh here is it:

var lexer *lexmachine.Lexer = lexmachine.NewLexer()
lexer.Add([]byte("import|require"), getToken(TokenTypeInclude))
lexer.Add([]byte("\"[^\\\"]*\"|'[^']*'|`[^`]*`"), getToken(TokenTypeString))
lexer.Add([]byte("//[^\n]*\n?|/\\*([^*]|\r|\n|(\\*+([^*/]|\r|\n)))*\\*+/"), getToken(TokenTypeComment))
lexer.Add([]byte("[A-Za-z$][A-Za-z0-9$]+"), getToken(TokenTypeIdentifier))
lexer.Add([]byte("function"), getToken(TokenTypeFunction))
lexer.Add([]byte("class"), getToken(TokenTypeClass))
lexer.Add([]byte(">=|<=|=|>|<|\\|\\||&&"), getToken(TokenTypeOperator))

Also here is my modification of token method from your example:

const (
    TokenTypeInclude int = iota
    TokenTypeComment
    TokenTypeString
    TokenTypeIdentifier
    TokenTypeFunction
    TokenTypeClass
    TokenTypeOperator
)
func getToken(tokenType int) lexmachine.Action {
    return func(s *lexmachine.Scanner, m *machines.Match) (interface{}, error) {
        return s.Token(tokenType, string(m.Bytes), m), nil
    }
}
timtadh commented 6 years ago

@omegascorp great I reproduced the error. I believe I know the root cause.

timtadh commented 6 years ago

@omegascorp should be fixed.

omegascorp commented 6 years ago

@timtadh Yeah it worked, but looks like something wrong with err.FailTC it contains the previous tc instead of current one, the value is the same as scanner.TC

timtadh commented 6 years ago

@omegascorp I fixed that as well. Thanks for catching the differences. (And thank you for testing!!)

timtadh commented 6 years ago

@omegascorp one note: You will want to put your function and class tokens before your identifier token otherwise the keywords will get tokenized as identifiers:

    lexer.Add([]byte("import|require"), getToken(TokenTypeInclude))
    lexer.Add([]byte("function"), getToken(TokenTypeFunction))
    lexer.Add([]byte("class"), getToken(TokenTypeClass))
    lexer.Add([]byte("\"[^\\\"]*\"|'[^']*'|`[^`]*`"), getToken(TokenTypeString))
    lexer.Add([]byte("//[^\n]*\n?|/\\*([^*]|\r|\n|(\\*+([^*/]|\r|\n)))*\\*+/"), getToken(TokenTypeComment))
    lexer.Add([]byte("[A-Za-z$][A-Za-z0-9$]+"), getToken(TokenTypeIdentifier))
    lexer.Add([]byte(">=|<=|=|>|<|\\|\\||&&"), getToken(TokenTypeOperator))
omegascorp commented 6 years ago

@timtadh thanks for noticing!

By the way DFA is really much faster, here is my performance tests (for whole my algorithm not only for lexmachine)

For lexer.Compile():

Time:  2.070578529s
Heap Alloc:  3766664
Heap Sys:  6782976
Total:  31248992

For lexer.CompileDFA():

Time:  714.634279ms
Heap Alloc:  1129992
Heap Sys:  6848512
Total:  49272816
timtadh commented 6 years ago

@omegascorp awesome. I am going to merge this and maybe a tag a new release in a bit. I want to make a pass on documentation as well.