Closed timtadh closed 6 years ago
@omegascorp
Any chance you want to test the new DFA backend? Just replace lexer.Compile()
with lexer.CompileDFA()
and it should work. The DFA backend provides O(n) matching at a higher memory and initial compilation cost. It is better if you generate a lexer once and then use it over and over again on different strings.
@timtadh tried to replace lexer.Compile()
with lexer.CompileDFA()
got an error:
If it will help I can also provide the list of the RegEx I'm adding to the lexer
panic: Access out of bounds. len(*List) = 0, idx = 0
goroutine 1 [running]:
github.com/timtadh/data-structures/errors.Errorf(0x5a635c, 0x2f, 0xc420059200, 0x2, 0x2, 0x675980, 0x0)
/home/omegascorp/gocode/src/github.com/timtadh/data-structures/errors/errors.go:23 +0x82
github.com/timtadh/data-structures/list.(*List).Get(0xc4206d03f0, 0x0, 0x65fc80, 0xc4206d03f0, 0x0, 0x0)
/home/omegascorp/gocode/src/github.com/timtadh/data-structures/list/array_list.go:285 +0xfa
github.com/timtadh/data-structures/list.(*Sorted).Get(0xc4206d03f0, 0x0, 0x65fc80, 0xc4206d03f0, 0x0, 0x0)
/home/omegascorp/gocode/src/github.com/timtadh/data-structures/list/sorted.go:135 +0x37
github.com/timtadh/lexmachine/dfa.(*DFA).minimize(0xc4200820f0, 0x0)
/home/omegascorp/gocode/src/github.com/timtadh/lexmachine/dfa/gen.go:309 +0x13a1
github.com/timtadh/lexmachine/dfa.Generate(0x65fd00, 0xc4200bd900, 0xc4200bd900)
/home/omegascorp/gocode/src/github.com/timtadh/lexmachine/dfa/gen.go:130 +0x1b87
github.com/timtadh/lexmachine.(*Lexer).CompileDFA(0xc420082c80, 0xc4200961d8, 0xc42005be08)
/home/omegascorp/gocode/src/github.com/timtadh/lexmachine/lexer.go:302 +0x7d
codetrace-analyzer/services/lexerService.getLexer(0xc4200128a0, 0xa, 0x500000000000000, 0xffffffffffffffff, 0x1f)
/home/omegascorp/gocode/src/codetrace-analyzer/services/lexerService/lexerService.go:44 +0x66
@omegascorp That would be great! Thank you for testing!!
@timtadh here is it:
var lexer *lexmachine.Lexer = lexmachine.NewLexer()
lexer.Add([]byte("import|require"), getToken(TokenTypeInclude))
lexer.Add([]byte("\"[^\\\"]*\"|'[^']*'|`[^`]*`"), getToken(TokenTypeString))
lexer.Add([]byte("//[^\n]*\n?|/\\*([^*]|\r|\n|(\\*+([^*/]|\r|\n)))*\\*+/"), getToken(TokenTypeComment))
lexer.Add([]byte("[A-Za-z$][A-Za-z0-9$]+"), getToken(TokenTypeIdentifier))
lexer.Add([]byte("function"), getToken(TokenTypeFunction))
lexer.Add([]byte("class"), getToken(TokenTypeClass))
lexer.Add([]byte(">=|<=|=|>|<|\\|\\||&&"), getToken(TokenTypeOperator))
Also here is my modification of token
method from your example:
const (
TokenTypeInclude int = iota
TokenTypeComment
TokenTypeString
TokenTypeIdentifier
TokenTypeFunction
TokenTypeClass
TokenTypeOperator
)
func getToken(tokenType int) lexmachine.Action {
return func(s *lexmachine.Scanner, m *machines.Match) (interface{}, error) {
return s.Token(tokenType, string(m.Bytes), m), nil
}
}
@omegascorp great I reproduced the error. I believe I know the root cause.
@omegascorp should be fixed.
@timtadh Yeah it worked, but looks like something wrong with err.FailTC
it contains the previous tc instead of current one, the value is the same as scanner.TC
@omegascorp I fixed that as well. Thanks for catching the differences. (And thank you for testing!!)
@omegascorp one note: You will want to put your function
and class
tokens before your identifier
token otherwise the keywords will get tokenized as identifiers:
lexer.Add([]byte("import|require"), getToken(TokenTypeInclude))
lexer.Add([]byte("function"), getToken(TokenTypeFunction))
lexer.Add([]byte("class"), getToken(TokenTypeClass))
lexer.Add([]byte("\"[^\\\"]*\"|'[^']*'|`[^`]*`"), getToken(TokenTypeString))
lexer.Add([]byte("//[^\n]*\n?|/\\*([^*]|\r|\n|(\\*+([^*/]|\r|\n)))*\\*+/"), getToken(TokenTypeComment))
lexer.Add([]byte("[A-Za-z$][A-Za-z0-9$]+"), getToken(TokenTypeIdentifier))
lexer.Add([]byte(">=|<=|=|>|<|\\|\\||&&"), getToken(TokenTypeOperator))
@timtadh thanks for noticing!
By the way DFA is really much faster, here is my performance tests (for whole my algorithm not only for lexmachine)
For lexer.Compile()
:
Time: 2.070578529s
Heap Alloc: 3766664
Heap Sys: 6782976
Total: 31248992
For lexer.CompileDFA()
:
Time: 714.634279ms
Heap Alloc: 1129992
Heap Sys: 6848512
Total: 49272816
@omegascorp awesome. I am going to merge this and maybe a tag a new release in a bit. I want to make a pass on documentation as well.
This is a new approach for creating a DFA backend for lexmachine. It generates the DFA directly from the regular expression after desugaring the Range AST nodes to Alternations + Characters. This approach creates a smaller initial DFA than the generation from the NFA byte code while also dealing correctly with Range transitions. Ranges have to be desugared or handled with care otherwise incorrect DFAs will be generated when their are overlapping ranges which (should) transition to the same DFA state. This implementation is based off of Algorithm 3.36 in the "Dragon Book."