usethesource / rascal

The implementation of the Rascal meta-programming language (including interpreter, type checker, parser generator, compiler and JVM based run-time system)
http://www.rascal-mpl.org
Other
406 stars 78 forks source link

Rascal's parser very slow on big files #866

Closed wnederhof closed 8 years ago

wnederhof commented 9 years ago

Parsing big files using Rascal's syntax-definitions does not work properly as it will be constantly loading (I tried to parse a file of around 4MB). Furthermore, I tried to load a (generated) 240KB Rascal file through an import-declaration, which also hung. After loading Rascal in debug mode, I found out that Rascal was still parsing after 45 minutes. I think this may somehow be caused by Rascal's backtracking facilities, as "Execute Selected Text in Console" works almost instantly, in which one statement is executed at a time (probably the console uses some kind of pre-parsing technique).

jurgenvinju commented 9 years ago

Hmmm. need more details. It has nothing to do with backtracking (Rascal uses a form of GLL parsing which does not backtrack, but rather predicts all alternatives "in parallel" and shares common derivation paths).

The difference between the console and an import is interesting. Both scenarios use the same parsing technique. It may be as simple as running out of memory in one scenario and not in the other; the wait time could be generated by the GC running like crazy?

wnederhof commented 9 years ago

One thing I found is that during debug mode, the function liftRec in the ASTBuilder is called recursively uncountable times by itself. If you like, I can send you the file I was trying to load per e-mail?

jurgenvinju commented 9 years ago

yes please do send it! liftRec is a recursive function over concrete syntax patterns, so it's normal to be called a lot in files with a lot of patterns. In any case it's not the parser that's bothering you then, since this is done some time after parsing.

DavyLandman commented 9 years ago

In general, debug mode is slower. So for speed always use release mode (we have some commits in /unstable/ that do improve performance of debug mode)

Next to sending the file, have you tried increasing the memory (-Xmx flag)? Java will fight very long before admitting it needs more memory. One of the signs is that your java process is suddenly consuming all your cores. (Rascal is quite single threaded for now)

wnederhof commented 9 years ago

I set the -Xmx flag in eclipse.ini to 2G and also in the Run settings for the Eclipse Application (as I run Rascal from source). Usually I do not start Rascal from Debug mode. top shows me that Java is using around 100-105% of my CPU (I assume this means it uses only one full core and a bit). Memory is around 40%, but is at 30% when idle. My system (Linux Mint 17.2) does not freeze up when importing the module, which it sometimes does when a program uses all of the available memory. Perhaps irrelevant, but I have a Lenovo X220 i7-2640M 2.8GHZ (dual core) with 4GB memory and 4GB of swap memory on my SSD. Hope this helps! :-)

tvdstorm commented 9 years ago

What is the grammar?

wnederhof commented 9 years ago
layout Whitespace   = [\t-\n\r\ ]*;
lexical Id          = [a-zA-Z][0-9a-zA-Z_]*;
lexical Wsp     = [\ \n\r];
lexical Comment = ("//" [^\n\r]*) | ("/*" [^*/] "*/") ;
lexical WS_     = (Wsp | Comment)*;
lexical WS__        = (Wsp | Comment)+;
lexical Id          = [a-zA-Z][0-9a-zA-Z_]*;
lexical ShortId = [a-zA-Z];
lexical LongId  = Id;

syntax AndExp = and: AndExp WS_ "&" WS_ NotExp ;
syntax AndExp = notexp: NotExp ;
syntax Digit = eight: "8" ;
syntax Digit = five: "5" ;
syntax Digit = four: "4" ;
syntax Digit = nine: "9" ;
syntax Digit = \one: "1" ;
syntax Digit = seven: "7" ;
syntax Digit = six: "6" ;
syntax Digit = three: "3" ;
syntax Digit = two: "2" ;
syntax Digit = zero: "0" ;
syntax Exp = letexp: WS_ LetExp WS_ ;
syntax FacNoPar = app1: FacNoPar WS__ Prim ;
syntax FacNoPar = app2: FacParen WS_ Prim ;
syntax FacNoPar = car: "car" WS__ Prim ;
syntax FacNoPar = cdr: "cdr" WS__ Prim ;
syntax FacNoPar = isbool: "boolean?" WS__ Prim ;
syntax FacNoPar = isint: "integer?" WS__ Prim ;
syntax FacNoPar = isneg: "negative?" WS__ Prim ;
syntax FacNoPar = isnull: "null?" WS__ Prim ;
syntax FacNoPar = ispair: "pair?" WS__ Prim ;
syntax FacNoPar = ispos: "positive?" WS__ Prim ;
syntax FacNoPar = isproc: "procedure?" WS__ Prim ;
syntax FacNoPar = iszero: "zero?" WS__ Prim ;
syntax FacNoPar = pred: "--" WS_ Prim ;
syntax FacNoPar = prim: Prim ;
syntax FacNoPar = succ: "++" WS_ Prim ;
syntax FacParen = app: Factor WS_ "(" WS_ LetExp WS_ ")" ;
syntax FacParen = car: "car" WS_ "(" WS_ LetExp WS_ ")" ;
syntax FacParen = cdr: "cdr" WS_ "(" WS_ LetExp WS_ ")" ;
syntax FacParen = cons: "cons" WS_ "(" WS_ LetExp WS_ "," WS_ LetExp WS_ ")" ;
syntax FacParen = isbool: "boolean?" WS_ "(" WS_ LetExp WS_ ")" ;
syntax FacParen = isint: "integer?" WS_ "(" WS_ LetExp WS_ ")" ;
syntax FacParen = isneg: "negative?" WS_ "(" WS_ LetExp WS_ ")" ;
syntax FacParen = isnull: "null?" WS_ "(" WS_ LetExp WS_ ")" ;
syntax FacParen = ispair: "pair?" WS_ "(" WS_ LetExp WS_ ")" ;
syntax FacParen = ispos: "positive?" WS_ "(" WS_ LetExp WS_ ")" ;
syntax FacParen = isproc: "procedure?" WS_ "(" WS_ LetExp WS_ ")" ;
syntax FacParen = iszero: "zero?" WS_ "(" WS_ LetExp WS_ ")" ;
syntax FacParen = paren: "(" WS_ LetExp WS_ ")" ;
syntax FacParen = pred: "--" WS_ "(" WS_ LetExp WS_ ")" ;
syntax FacParen = succ: "++" WS_ "(" WS_ LetExp WS_ ")" ;
syntax Factor = facparen: FacParen ;
syntax Factor = noparen: FacNoPar ;
syntax IntConst = digit: Digit ;
syntax IntConst = more: IntConst Digit ;
syntax LetExp = \if: OrExp WS_ "?" WS_ LetExp WS_ ":" WS_ LetExp ;
syntax LetExp = letfun: "let" WS__ Id WS_ "(" WS_ Id WS_ ")" WS_ "=" WS_ LetExp WS_ "in" WS_ LetExp ;
syntax LetExp = letrec: "letrec" WS__ Id WS_ "(" WS_ Id WS_ ")" WS_ "=" WS_ LetExp WS_ "in" WS_ LetExp ;
syntax LetExp = letvar: "let" WS__ Id WS_ "=" WS_ LetExp WS_ "in" WS_ LetExp ;
syntax LetExp = orexp: OrExp ;
syntax NotExp = not: "!" WS_ RelExp ;
syntax NotExp = relexp: RelExp ;
syntax OrExp = andexp: AndExp ;
syntax OrExp = or: OrExp WS_ "|" WS_ AndExp ;
syntax OrExp = xor: OrExp WS_ "^" WS_ AndExp ;
syntax Prim = \false: "#f" ;
syntax Prim = intConst: IntConst ;
syntax Prim = null: "#null" ;
syntax Prim = \true: "#t" ;
syntax Prim = var: Id ;
syntax RelExp = eq: SimpleExp WS_ "=" WS_ SimpleExp ;
syntax RelExp = gt: SimpleExp WS_ "\>" WS_ SimpleExp ;
syntax RelExp = gte: SimpleExp WS_ "\>=" WS_ SimpleExp ;
syntax RelExp = lt: SimpleExp WS_ "\<" WS_ SimpleExp ;
syntax RelExp = lte: SimpleExp WS_ "\<=" WS_ SimpleExp ;
syntax RelExp = neq: SimpleExp WS_ "!=" WS_ SimpleExp ;
syntax RelExp = simple: SimpleExp ;
syntax SimpleExp = add: SimpleExp WS_ "+" WS_ Term ;
syntax SimpleExp = sub: SimpleExp WS_ "-" WS_ Term ;
syntax SimpleExp = term: Term ;
syntax SimpleExp = uminus: "-" WS_ Term ;
syntax SimpleExp = uplus: "+" WS_ Term ;
syntax Term = factor: Factor ;
syntax Term = mul: Term WS_ "*" WS_ Factor ;

syntax Body = exp: "." Exp!letexp e;
syntax Body = lam: Name n Body b;
syntax Exp = app: Exp!letexp e Short s;
syntax Exp = short: Short s;
syntax Name = longid: "\<" LongId lId "\>" ;
syntax Name = shortid: ShortId sId;
syntax Short = lam: "\\" Name n Body b;
syntax Short = paren: "(" Exp!letexp e ")" ;
syntax Short = var: Name n;
wnederhof commented 9 years ago

One thing you may notice is that I use WS_ and WS__ and a separate whitespace layout. This is because I built a generator that translates Banana Algebra into Rascal and I have not removed them yet. However, I experienced the same problem with a totally different grammar with no such tokens.

tvdstorm commented 9 years ago

Tip: remove all whitespace stuff, and extend lang::std::Layout and extend lang::std::Comment.

You should add follow restriction on lexicals:

lexical Id          = [a-zA-Z][0-9a-zA-Z_]* !>> [a-z-A-Z0-9];
lexical ShortId = [a-zA-Z];
lexical LongId  = Id;

Probably it's slow because all your layout is ambiguous.

tvdstorm commented 9 years ago

Oh, and WS_ is not needed.

wnederhof commented 9 years ago

Thank you for your help. I have to admit that I did not spend too much time on the layout yet.

I removed all WS_, WS__ and layouts, restarted Rascal and set the header of the file (right before the actual syntax definitions) to:

extend lang::std::Layout;
extend lang::std::Comment;

lexical Id      = [a-zA-Z][0-9a-zA-Z_]* !>> [a-zA-Z0-9];
lexical ShortId = [a-zA-Z];
lexical LongId  = Id;

The only ambiguity I can see still is that (Exp)a can either be a Short.var or a Prim.var. Unfortunately, it is still not loading.

DavyLandman commented 9 years ago
syntax LetExp = letfun: "let" WS__ Id WS_ "(" WS_ Id WS_ ")" WS_ "=" WS_ LetExp WS_ "in" WS_ LetExp ;
syntax LetExp = letrec: "letrec" WS__ Id WS_ "(" WS_ Id WS_ ")" WS_ "=" WS_ LetExp WS_ "in" WS_ LetExp ;
syntax LetExp = letvar: "let" WS__ Id WS_ "=" WS_ LetExp WS_ "in" WS_ LetExp ;

syntax means that layout is added between all elements in your production. So this might nog be what you want. There is an undocumented flag to provide custom layout for your production, I think you want that.

The tag is @manual

so it would be:

syntax LetExp 
    = @manual letfun: "let" WS__ Id WS_ "(" WS_ Id WS_ ")" WS_ "=" WS_ LetExp WS_ "in" WS_ LetExp
    | @manual letrec: "letrec" WS__ Id WS_ "(" WS_ Id WS_ ")" WS_ "=" WS_ LetExp WS_ "in" WS_ LetExp 
    | @manual LetExp = letvar: "let" WS__ Id WS_ "=" WS_ LetExp WS_ "in" WS_ LetExp
    ;

and:

lexical WS_     = (Wsp | Comment)* !>> [\ \n\r];
lexical WS__        = (Wsp | Comment)+ !>> [\ \n\r];
DavyLandman commented 8 years ago

can this issue be closed?

jurgenvinju commented 8 years ago

Yes, the performance is grammar dependent of course. We can't fix anything here.