memory leaks when parsing chinese

picoe / Eto.Parse

Recursive descent LL(k) parser for .NET with Fluent API, BNF, EBNF and Gold Grammars

MIT License

148 stars 30 forks source link

memory leaks when parsing chinese #48

Closed fawdlstty closed 3 years ago

fawdlstty commented 3 years ago

var _grammar = new EbnfGrammar (EbnfStyle.W3c).Build ($"id ::= [a-zA-Z\u0100-\uffff_][0-9a-zA-Z\u0100-\uffff_]*", "id");
var _match = _grammar.Match ("张三李四");

fawdlstty commented 3 years ago

Can we add more examples? About the EBNF parser method

fawdlstty commented 3 years ago

var _grammar = new EbnfGrammar (EbnfStyle.W3c).Build (@"
/* base */
id                  ::= [a-zA-Z_][0-9a-zA-Z_]*
s                   ::= [   
]+

svar                ::= id
svar_op0_expr       ::= '(' svar_expr ')'
svar_op1_expr       ::= (('++'|'--') svar_expr) | (svar_expr ('++'|'--'))
svar_expr           ::= s? (svar | svar_op0_expr | svar_op1_expr) s?
", "svar_expr");
var _match = _grammar.Match (" ++ abc");

This code will cause an endless loop and the stack overflow

cwensley commented 3 years ago

Thanks for reporting the issue! Do you know what is leaking? Are you reusing the grammar or building it every time?

cwensley commented 3 years ago

Also, interesting to note that you are using this for non-english languages. I've used invariant versions of methods (e.g. char.ToLowerInvariant()) for case sensitivity as it is faster. Is that something that needs fixing up for your needs?

cwensley commented 3 years ago

Ah, I think this is less of a memory leak vs. a memory hog. Some of the optimizations are to change ranges into a dictionary lookup, this should have some limit to the range of characters before optimizing that.

cwensley commented 3 years ago

One way to get around this for now is to turn off the GrammarOptimizations.CharacterSetAlternations optimization, for example:

myGrammar.Optimizations = Optimizations.All & ~GrammarOptimizations.CharacterSetAlternations;