Prevent "bootstrapping" a full parser generator for deployed DSL implementations

DavyLandman commented 1 year ago

Is your feature request related to a problem? Please describe. Currently a DSL can take up to 30s to load, due to Rascal's overhead of generating parsers for files with concrete syntax. Users are looking at a piece of text without syntax highlighting.

Describe the solution you'd like I want to extend the registerLanguage call with a classname. If that classname is defined, we use that for parsing files in that language. This would allow us to provide syntax highlighting as quickly as the JVM has loaded the jars. (a developer is responsible for correctly generating a rascal parser class).

Describe alternatives you've considered

Make rascal imports faster. We've done quite some work here, but for example the cached parsers don't work in relationship to the concrete syntax. In general, if we have the compiler, we should expect faster loading times, but that still some time in the future.
Provide a module name that contains the syntax element for registerLanguage, but that would still require us import modules, generate the grammar and compile the java.

Additional context This is a deployment concern. During development these annotations should be ignored.

jurgenvinju commented 1 year ago

There is a need know what contributes significantly to this bottlebeck:

loading the parser generator
reifying the grammar
normalizing the grammar before parser generation
generating the parser
or all of the above?

Requirements on a solution:

does not break API
does not leak the implementation details of the parsing technology (which is why I am not for the proposed solution)
does not break the reified grammars abstraction and the parsers API function from ParseTree.

It is highly likely that future parsers will be stored in different formats than .class files. This is (a) because the parsing technology becomes more dynamic, or (b) the class loading becomes the bottleneck in parsing. The fastest Java parser I know (JikesPG) is so fast because it does not use the class loading mechanism to instantiate a parser.

Grammars are not only used for parsing in a lot of Rascal code. Every grammar reification, whether used for parsing or not, will trigger a full load of the parser generator. It is very hard and also brittle to weed out all the # operators from a complete program "only" for the sake of efficiency.

[Thinking-out-loud-begin]

Looking at this from a programming language perspective: the parser is a compile-time constant that is derived from the grammar, which is also a compile-time constant. In concrete syntax expressions only this constant is used. And if we call the parse function we also use a constant #start[Program].

So in a way, this is a special case of constant-folding. Step 1:

There at least 2 but perhaps more constant parsers after deployment of an interpreted Rascal programs, based on 2 or more constant reified grammars. One is for concrete syntax, and has all the declared rules in scope. The other is for parsing object language input sentences, and has only the reachable rules from the start non-terminal.
All parsers are generated via ParseTree::parsers which already implements a run-time cache based on the structural identity of the grammar parameter.
If ParseTree::parsers, which is what neatly wraps and hides all parsing implementation details, would cache parsers in files, this would go a long way in resolving the issue with parser generation. And it would satisfy the above requirements of keeping it clean.

Step 2:

However, the entire parser generator is loaded in a nested interpreter before parser generation. That code is then used to compute the reified grammar and to normalize that grammar before parser generatioon. The load time of the modules, and the normalization time are still prohibitive. This is all triggered when this term is interpreted: #start[Program], for the first time.
Constant/cached parsers, are thus not good enough. We also have to instrument the interpreter to think about using cached reified grammars;
- option 1: let the interpreter read the grammar from a .tpl file if it exists. A bit unorthodox, since we have never tied the interpreter into the compiler architecture before. But, the .tpl files contain all this information already.
- option 2: let the interpreter cache all reified grammars on disk in a new file type with a filename encoding that captures both the module scope and (optionally) the top-nonterminal. DON'T KNOW HOW YET.
- option 3 (bad): specialize the interpreter to delay the reification of the grammar until the first client asks for it. That way the parsers function will be called with a lazy reified grammar, and the parsers function can decide to read a cached parser even before reifying and normalizing the grammar. This would entail that the reified type is an adequate key (WHICH IT IS NOT due to module scope semantics) for a unique grammar.
- option 4: read grammars from disk yourself: type[start[Program]] myGrammar = readBinaryValueFile(#type[start[Program], |someJarLocation|); However, that would not help us with the grammar file for the module. That requires option 5.
- option 5: the interpreter will after module import store a grammar file for future usage, and that file will be removed when the current module is reloaded.

So that amounts to:

let parsers be cached on disk by ParseTree.parsers based on grammar-hashed file names for class files
(option 4) let grammar values on the user side by cached by the users (by reading them from their own jar files), or (option 2) let the interpreter cache constant grammar reifications.
let concrete syntax grammars be cached on disk by the interpreters import-and-module-parser functionality

Step 2 and 3 both lead into step. I'd prefer 2 over 4 but I don't know how to index #start[Program] yet such that there are no accidental collisions.

jurgenvinju commented 1 year ago

By the way, this is not only a deployment concern, since Rascal programs with reified grammars really load slowly, they are also annoying whene making incremental developments in the REPL. Starting a new REPL takes way too much time.

jurgenvinju commented 1 year ago

However, if we find a solution that only works for deployed programs, that would be great.

DavyLandman commented 1 year ago

I think there are many gains to be had in this area of rascal. You're analysis is right. But also it's a big change you are proposing. And not solving the specific issue we run into.

What I want to try and specifically solve is to reduce the time a user is looking at syntax-highlighting free code. VS Code will trigger rascal the first time it's sees a file of a certain language. And since we've moved the syntax highlighting to the semantic highlighter, it will be without syntax highlighting.

So what is happening before we get syntax highlighting?

start extension
registerLanguage to tell rascal-lsp about where to find the module that contains the contributions
rascal-lsp starts an evaluator
import contributions module this module loads a lot of other modules, the one with the grammar, the one with the typepal based typechecker (with lots of concrete syntax)
after importing is done (at which time every module with concrete syntax will have had a new grammar generated for it) we call the function that build the contributions
that function gets a parser for start[..] and returns that. This again generates a parser, now the first time without concrete syntax and rascal in the mix, so a new one again.

So before we get to parse the file, we might have had to calculate the grammar, generate the parser and compiler the parser 4 to 5 times (depending on the amount of modules with concrete syntax).

I want to find a way to cut that down. I would also prefer a way that's less depending on the current implementation of Rascal, but then again, the deployed scenarios are using fixed versions of rascal, so if that interface changes, we have a way to deal with that.

My idea proposal was backwards compatible by making it a kw param.

DavyLandman commented 1 year ago

With regards to storing the grammar in a file, we would indeed be faster, since we only need to load a dedicated evaluator that runs the parser generator and compile the java files. I'll try and run some profiles to see which part we are spending most time in.

jurgenvinju commented 1 year ago

for the files with the same concrete syntax, the parser will already be reused. If you just import the same grammar again and again in different modules, there will not be yet again a parser generator to run.
the time to load the parser generator itself is considerable. That is certainly what you are looking at when starting a new evaluator for a new LSP run.
between every LSP instance there is a new nested parser generator which takes time to load, again and again.

Is suspect that in the given start-up scenario just before syntax highlighting works we are looking mainly at the loading of the parser generator and the reduction of the #start[] non-terminal to a grammar value. But let's see.

Side note:

It could also be an elegant idea to move forward with the current compiler and compiler the parser generator to Java, and then use one (locked) instance between all the LSP instances (or have one for each instance). That way the loading time should be much lower and we can then focus on making the reification much faster as well (which would be implemented by the same generated Java code library).

DavyLandman commented 1 year ago

Profile, of a new registerLanguage just after a clear of that existing language. (just to fill the caches a bit, which isn't even the case for a fresh start).

already for speed purposes we have the following pattern:

module lang::DSL::Syntax

syntax start Module = ...;

// since Rascal caches the internal parser, but they can inspire, we keep around an instance of our parser
// this helps us avoid regenerating parser if a user stopped interaction with their IDE for a while
private start[Module] (value, loc) cachedParser = parser(#start[Module]);

public start[Module] (value, loc) getParser() = cachedParser;

It does mean that during import we do most of our calculation, and that the actual contributions call is not visible in the trace, as it just returns this cachedParser that was already build during initialization of the form.

importing lang::DSL::LanguageServer took 13s.
- 6s in SGTDBF::parse (even split between parser generator and our own modules)
- 3.8s in constructor of ParserGenerator (which starts a fresh evaluator with the parser generator)
- 1s in ParserGenerator::getGrammarFromModules (which is the actual call to calculate grammers, and is called 3 times, 2 times for concrete syntax, 1 time for the initializer of the parser function)
- 1.5s in parser function that calls ParserGenerator.getNewParser
  - most time in the rascal function
  - 100ms compiling java.
- 100ms compiling the grammar with concrete syntax merged into it.

So my observations (which align with your suspicions):

Modules with concrete syntax cost a bit but only 100ms per module (as the grammar needs to be calculated before we can see that we already had a Rascal+DSL parser compiler).
Parsing takes up quite some time, some modules took a full second to parse (I suspect the module with the syntax, or other wise the one with the concrete syntax).
aside from parsing, loading the ParserGenerator is the larges "startup" penalty we always pay.

If we could do something about the parser generator load times and it's performance, that would be nice indeed.

jurgenvinju commented 1 year ago

Excellent info.

1s for parsing a module is ridiculously slow. definitely something to look into.
I really wonder what the cost is of loading a compiled Parser Generator. @PaulKlint do you have a idea what it takes to load the ParserGenerator module from a main function?
sharing a ParserGenerator evaluator does not really help for the first language, so I'd rather look for a solution that also helps with that.
If loading modules takes a second per module, then that also affects exactly the loading of the Parser Generator itself, which has many modules.
So, perhaps we should:
1. figure out why parsing (and mapping to AST) of a single module takes so long, and
2. in parallel find out if a compiled ParserGenerator is faster, and if so how much, and if enough faster if we could switch on short notice to a compiled version.

DavyLandman commented 1 year ago

I quickly added a timer around the parse in Import::parseModuleAndFragments and ran it with my machine in slow mode (limited to 4 of it's slower cores). Now parsing the parsergenerator module takes 900ms during import.

Parsing: file:///..../src/org/rascalmpl/library/lang/rascal/grammar/ParserGenerator.rsc took: 905ms

fyi: I only measured the raw parsing time:

      var start = System.nanoTime();
      IActionExecutor<ITree> actions = new NoActionExecutor();

      ITree tree = new RascalParser().parse(Parser.START_MODULE, location.getURI(), data, actions, new DefaultNodeFlattener<IConstructor, ITree, ISourceLocation>(), new UPTRNodeFactory(true));

      var stop = System.nanoTime();
      var time = java.util.concurrent.TimeUnit.NANOSECONDS.toMillis(stop - start);
      if (time > 500) {
        System.err.println("Parsing: " + location.getURI().toString() + " took: " + time + "ms");
      }

jurgenvinju commented 1 year ago

That's a large file, and there are many others like List and Set and Map and grammar::definition::Modules which are similarly large.

Raw parsing time is not something that is easily optimized, as we already have optimized to the bone (@arnoldlankamp ); however, there is a UPTRNodeFactory included that maps the SPPF parse graph to the Tree format we are used to. I'm hoping there is something to get out of that.

Separately the mapping to subclasses of AbstractAST could be expensive.

jurgenvinju commented 1 year ago

This is may trace for import lang::rascal::grammar::ParserGenerator; with printTimes set to true in org.rascalmpl.parser.gtd.SGTDBF which is the main parser driver:

rascal>import lang::rascal::grammar::ParserGenerator;
Parsing: 57
Unbinarizing, post-parse filtering, and mapping to UPTR: 16
Parsing: 3
Unbinarizing, post-parse filtering, and mapping to UPTR: 1
Loading module |file:///Users/jurgenv/git/rascal/src/org/rascalmpl/library/lang/rascal/grammar/ParserGenerator.rsc|
Parsing: 444
Unbinarizing, post-parse filtering, and mapping to UPTR: 82
Loading module |file:///Users/jurgenv/git/rascal/src/org/rascalmpl/library/String.rsc|
Parsing: 67
Unbinarizing, post-parse filtering, and mapping to UPTR: 26
Loading module |file:///Users/jurgenv/git/rascal/src/org/rascalmpl/library/List.rsc|
Parsing: 95
Unbinarizing, post-parse filtering, and mapping to UPTR: 36
Loading module |file:///Users/jurgenv/git/rascal/src/org/rascalmpl/library/Map.rsc|
Parsing: 18
Unbinarizing, post-parse filtering, and mapping to UPTR: 10
Loading module |file:///Users/jurgenv/git/rascal/src/org/rascalmpl/library/IO.rsc|
Parsing: 64
Unbinarizing, post-parse filtering, and mapping to UPTR: 26
Loading module |file:///Users/jurgenv/git/rascal/src/org/rascalmpl/library/Exception.rsc|
Parsing: 6
Unbinarizing, post-parse filtering, and mapping to UPTR: 4
Loading module |file:///Users/jurgenv/git/rascal/src/org/rascalmpl/library/lang/rascal/syntax/Rascal.rsc|
Parsing: 64
Unbinarizing, post-parse filtering, and mapping to UPTR: 30
Loading module |file:///Users/jurgenv/git/rascal/src/org/rascalmpl/library/ParseTree.rsc|
Parsing: 49
Unbinarizing, post-parse filtering, and mapping to UPTR: 29
Loading module |file:///Users/jurgenv/git/rascal/src/org/rascalmpl/library/Message.rsc|
Parsing: 2
Unbinarizing, post-parse filtering, and mapping to UPTR: 1
Loading module |file:///Users/jurgenv/git/rascal/src/org/rascalmpl/library/Type.rsc|
Parsing: 204
Unbinarizing, post-parse filtering, and mapping to UPTR: 48
Loading module |file:///Users/jurgenv/git/rascal/src/org/rascalmpl/library/lang/rascal/grammar/definition/Productions.rsc|
Parsing: 24
Unbinarizing, post-parse filtering, and mapping to UPTR: 4
Loading module |file:///Users/jurgenv/git/rascal/src/org/rascalmpl/library/lang/rascal/grammar/definition/Attributes.rsc|
Parsing: 11
Unbinarizing, post-parse filtering, and mapping to UPTR: 3
Loading module |file:///Users/jurgenv/git/rascal/src/org/rascalmpl/library/ValueIO.rsc|
Parsing: 5
Unbinarizing, post-parse filtering, and mapping to UPTR: 2
Loading module |file:///Users/jurgenv/git/rascal/src/org/rascalmpl/library/lang/rascal/grammar/definition/Literals.rsc|
Parsing: 25
Unbinarizing, post-parse filtering, and mapping to UPTR: 4
Loading module |file:///Users/jurgenv/git/rascal/src/org/rascalmpl/library/Grammar.rsc|
Parsing: 13
Unbinarizing, post-parse filtering, and mapping to UPTR: 4
Loading module |file:///Users/jurgenv/git/rascal/src/org/rascalmpl/library/util/Maybe.rsc|
Parsing: 2
Unbinarizing, post-parse filtering, and mapping to UPTR: 0
Loading module |file:///Users/jurgenv/git/rascal/src/org/rascalmpl/library/lang/rascal/grammar/definition/Names.rsc|
Parsing: 11
Unbinarizing, post-parse filtering, and mapping to UPTR: 3
Loading module |file:///Users/jurgenv/git/rascal/src/org/rascalmpl/library/lang/rascal/grammar/definition/Symbols.rsc|
Parsing: 27
Unbinarizing, post-parse filtering, and mapping to UPTR: 5
Loading module |file:///Users/jurgenv/git/rascal/src/org/rascalmpl/library/lang/rascal/grammar/definition/Characters.rsc|
Parsing: 61
Unbinarizing, post-parse filtering, and mapping to UPTR: 8
Loading module |file:///Users/jurgenv/git/rascal/src/org/rascalmpl/library/lang/rascal/grammar/definition/Regular.rsc|
Parsing: 21
Unbinarizing, post-parse filtering, and mapping to UPTR: 3
Loading module |file:///Users/jurgenv/git/rascal/src/org/rascalmpl/library/lang/rascal/grammar/definition/Modules.rsc|
Parsing: 17
Unbinarizing, post-parse filtering, and mapping to UPTR: 4
Loading module |file:///Users/jurgenv/git/rascal/src/org/rascalmpl/library/lang/rascal/grammar/definition/Layout.rsc|
Parsing: 26
Unbinarizing, post-parse filtering, and mapping to UPTR: 15
Loading module |file:///Users/jurgenv/git/rascal/src/org/rascalmpl/library/Set.rsc|
Parsing: 29
Unbinarizing, post-parse filtering, and mapping to UPTR: 12
Loading module |file:///Users/jurgenv/git/rascal/src/org/rascalmpl/library/util/Math.rsc|
Parsing: 27
Unbinarizing, post-parse filtering, and mapping to UPTR: 10
Loading module |file:///Users/jurgenv/git/rascal/src/org/rascalmpl/library/lang/rascal/grammar/ConcreteSyntax.rsc|
Parsing: 16
Unbinarizing, post-parse filtering, and mapping to UPTR: 14
Loading module |file:///Users/jurgenv/git/rascal/src/org/rascalmpl/library/lang/rascal/grammar/definition/Keywords.rsc|
Parsing: 7
Unbinarizing, post-parse filtering, and mapping to UPTR: 1
Loading module |file:///Users/jurgenv/git/rascal/src/org/rascalmpl/library/Node.rsc|
Parsing: 10
Unbinarizing, post-parse filtering, and mapping to UPTR: 3
Loading module |file:///Users/jurgenv/git/rascal/src/org/rascalmpl/library/lang/rascal/grammar/definition/Priorities.rsc|
Parsing: 48
Unbinarizing, post-parse filtering, and mapping to UPTR: 12
Loading module |file:///Users/jurgenv/git/rascal/src/org/rascalmpl/library/lang/rascal/format/Grammar.rsc|
Parsing: 87
Unbinarizing, post-parse filtering, and mapping to UPTR: 12
Loading module |file:///Users/jurgenv/git/rascal/src/org/rascalmpl/library/analysis/graphs/Graph.rsc|
Parsing: 31
Unbinarizing, post-parse filtering, and mapping to UPTR: 7
Loading module |file:///Users/jurgenv/git/rascal/src/org/rascalmpl/library/Relation.rsc|
Parsing: 48
Unbinarizing, post-parse filtering, and mapping to UPTR: 22
Loading module |file:///Users/jurgenv/git/rascal/src/org/rascalmpl/library/lang/rascal/format/Escape.rsc|
Parsing: 23
Unbinarizing, post-parse filtering, and mapping to UPTR: 5
Loading module |file:///Users/jurgenv/git/rascal/src/org/rascalmpl/library/analysis/grammars/Dependency.rsc|
Parsing: 4
Unbinarizing, post-parse filtering, and mapping to UPTR: 1
Loading module |file:///Users/jurgenv/git/rascal/src/org/rascalmpl/library/lang/rascal/grammar/definition/References.rsc|
Parsing: 2
Unbinarizing, post-parse filtering, and mapping to UPTR: 0
Loading module |file:///Users/jurgenv/git/rascal/src/org/rascalmpl/library/lang/rascal/grammar/Lookahead.rsc|
Parsing: 98
Unbinarizing, post-parse filtering, and mapping to UPTR: 13
Loading module |file:///Users/jurgenv/git/rascal/src/org/rascalmpl/library/lang/rascal/grammar/definition/Parameters.rsc|
Parsing: 14
Unbinarizing, post-parse filtering, and mapping to UPTR: 3
Loading module |file:///Users/jurgenv/git/rascal/src/org/rascalmpl/library/util/Monitor.rsc|
Parsing: 6
Unbinarizing, post-parse filtering, and mapping to UPTR: 2
ok

With CPU profiling turned on for Rascal itself, you get a good view of the largest modules:

FRAMES PROFILE: 130 data points, 3143 ticks, tick = 1 milliSecs
                                         Scope   Ticks        %  Source
        lang::rascal::grammar::ParserGenerator     833    26.5%  |main://lang::rascal::grammar::ParserGenerator|
                                     ParseTree     336    10.7%  |main://ParseTree|
                                       $shell$     304     9.7%  |main://$shell$|
                                          List     204     6.5%  |main://List|
 lang::rascal::grammar::definition::Priorities     185     5.9%  |main://lang::rascal::grammar::definition::Priorities|
                  lang::rascal::syntax::Rascal     177     5.6%  |main://lang::rascal::syntax::Rascal|
                 lang::rascal::format::Grammar     158     5.0%  |main://lang::rascal::format::Grammar|
                                        String     158     5.0%  |main://String|
lang::rascal::grammar::definition::Productions     128     4.1%  |main://lang::rascal::grammar::definition::Productions|
    lang::rascal::grammar::definition::Modules     109     3.5%

jurgenvinju commented 1 year ago

But the slowness does not seem to be related much to the size of the file:

$ wc src/org/rascalmpl/library/lang/rascal/grammar/ParserGenerator.rsc 
     633    2731   28435 src/org/rascalmpl/library/lang/rascal/grammar/ParserGenerator.rsc
$ wc src/org/rascalmpl/library/ParseTree.rsc 
     790    4058   28576 src/org/rascalmpl/library/ParseTree.rsc
$ wc src/org/rascalmpl/library/List.rsc 
    1138    3553   24912 src/org/rascalmpl/library/List.rsc

jurgenvinju commented 1 year ago

If you parse the same module again, the parsing time goes down aggressively. Possibly this is the JIT kicking in:

rascal>import lang::rascal::grammar::ParserGenerator;
Parsing: 0
Unbinarizing, post-parse filtering, and mapping to UPTR: 0
Parsing: 0
Unbinarizing, post-parse filtering, and mapping to UPTR: 0
Loading module |file:///Users/jurgenv/git/rascal/src/org/rascalmpl/library/lang/rascal/grammar/ParserGenerator.rsc|
Parsing: 108
Unbinarizing, post-parse filtering, and mapping to UPTR: 17

jurgenvinju commented 1 year ago

Observations:

practically the ParserGenerator module is always the very first module to be parsed after the initial main module is parsed.
this module contains a hard part of the grammar, and also a very long instance: string templates
in the generated code parts of the grammar are distributed over nested classes with their methods, so each part of the grammar has to "warm up" separately.
if I remove the string template from the file, the initial parse time goes down accordingly but not more than linearly

jurgenvinju commented 1 year ago

100ms parsing time for that module should be a good target to optimize for.
It would bring down the loading of the PG significantly if the effect is similar for all files.
However, we must assume that the more files we parse, the hotter the parser becomes.

jurgenvinju commented 1 year ago

it could be we are looking at the JIT compiler itself taking time to optimize the string template parser.

jurgenvinju commented 1 year ago

With the JIT turned off:

rascal>import lang::rascal::grammar::ParserGenerator;
Parsing: 101
Unbinarizing, post-parse filtering, and mapping to UPTR: 20
Parsing: 5
Unbinarizing, post-parse filtering, and mapping to UPTR: 1
Loading module |file:///Users/jurgenv/git/rascal/src/org/rascalmpl/library/lang/rascal/grammar/ParserGenerator.rsc|
Parsing: 2454
rascal>import lang::rascal::grammar::ParserGenerator;
Parsing: 7
Unbinarizing, post-parse filtering, and mapping to UPTR: 1
Parsing: 7
Unbinarizing, post-parse filtering, and mapping to UPTR: 1
Loading module |file:///Users/jurgenv/git/rascal/src/org/rascalmpl/library/lang/rascal/grammar/ParserGenerator.rsc|
Parsing: 2217
Unbinarizing, post-parse filtering, and mapping to UPTR: 592
ok
rascal>import lang::rascal::grammar::ParserGenerator;
Parsing: 6
Unbinarizing, post-parse filtering, and mapping to UPTR: 1
Parsing: 5
Unbinarizing, post-parse filtering, and mapping to UPTR: 1
Loading module |file:///Users/jurgenv/git/rascal/src/org/rascalmpl/library/lang/rascal/grammar/ParserGenerator.rsc|
Parsing: 2169
Unbinarizing, post-parse filtering, and mapping to UPTR: 578
ok

Indeed, the parser does not run faster the second time or the third time. And it is also slower the first time. So we may attribute the difference between 100ms and 600ms entirely to the JIT compiler.

jurgenvinju commented 1 year ago

just made all the generated parse methods private and final as an experiment to help the JIT. no effect.

jurgenvinju commented 1 year ago

removing -ea takes off 25% of the parsing time

arnoldlankamp commented 1 year ago

If you want to measure the execution time of some piece of code you can use ManagementFactory.getThreadMXBean().getCurrentThreadCpuTime(). This cuts out most of the JVM related noise (i.e.: GC & JIT). It may be interesting to see what the difference is between CPU time and real time.

The parser implementation can be GC heavy for highly ambiguous grammars, since in those cases it produces a large graph in which many branches will remain 'alive' for a long time. So lots of marking/tracing/moving, but very little memory being freed up each cycle. But I doubt this is the main issue in this specific case, since there shouldn't be a large amount of long lived stacks running in parallel, while parsing a file using the rascal grammar.

If it turns out a large portion of the time is spend by the JVM, you could try and do a run with the old CMS GC instead of G1. While G1 is faster for like 98% of the normal use cases, it can sometimes perform absolutely terribly on big and/or cyclic graph like data structures (like the parser produces). Since G1 segments the heap (to be able to deal with larger heap sizes efficiently), it needs to keep track of cross segment pointers. Graphs are generally harder to segment and it can run into trouble with these.

arnoldlankamp commented 1 year ago

As for improving the performance of the parser algorithm/implementation; that would be a challenge, as it's already very highly optimized across all available dimensions.

The only semi-significant performance improvement that comes to mind is to implement something like an 'algorithm stack'. Classify each part of the grammar (as possibly ambiguous, certainly non-ambiguous, regular or unique match) and parse each segment using its most efficient implementation. However this will have a significant cost in terms of implementation complexity and will add a non-trivial analysis burden to the parser generator. Figuring out how to efficiently do that last part alone, could probably net you a doctoral degree 😄 .

Other than that 🤷‍♂️ . Migrate the implementation to C or Rust 😛 ?

arnoldlankamp commented 1 year ago

But in terms of this specific issue. Why not pre-parse the standard library for Rascal, save the ASTs and ship them with the distribution? Just use the name + hash of the rsc file when saving the ASTs, so the cached version won't be used in case you are modifying the Rascal source.

The process would, for example, look like:

Requests parse of ParserGenerator.rsc
Calculate hash for ParserGenerator.rsc (e.g.: d5gq4w4g4r7jujsg43y6)
Look for ParserGenerator-d5gq4w4g4r7jujsg43y6-cached.ast
- Found -> Use cached AST
- Not found -> Parse ParserGenerator.rsc

DavyLandman commented 1 year ago

I did some profiling of the parser (just a loop that went through all rascal modules).

Almost no GC.

Parsing phases split:

Here is reduce:

Here is expand:

And the Node flattener is one heck of a recursive party:

The hotspots are this:

So this roughly alines with the: to make the parser faster we'll have to tweak the algorithm. Perhaps we could take a look at the EdgeSet and the IntegerMap and IntegerKeyedDoubleValueHashMap.

arnoldlankamp commented 1 year ago

When looking at the traces, nothing immediately stands out to me. It looks as I would expect ; a pretty even split between reduce and expand, with most of the work being done in updateNextNode.

buildResult also looks pretty decent, since it doesn't seem to have to deal with ambiguities.

The recursion in the flattener is not that surprising, since it needs to traverse all the way down the binarized parse forest to be able to flatten it. The flattener actually contains a conditional tail recursion optimization to keep the stack depth in check when handling lists (otherwise you'd run out of stack size fast). Doing it entirely without recursion would be even more complicated than it already is, because of all the code needed for keeping track of ambiguities and cycles in the parse forest. You're basically dealing with forking and recombining stacks; without using recursive functions you'd need to keep track of even more state. What would be possible, is to remove the layer of abstraction from the flattener by removing all the factories and just hardwiring the concrete implementation. There should be an older version of the flattener like this in the version control history (likely from before the Git migration), which might still be fully functional as is, unless any changes where made in the flattener code at a later date. You could shave ~10% or more of the buildResult time at the cost of more easily being able to swap to a different AST implementation or the option to support multiple different AST implementations simultaneously. However looking at the above trace, this would only net you a 1-2% decrease in overal execution time, which may not be worth it.

As for the custom data structures. These are basically specialized implementations aimed as reducing memory usage and improving performance. Maybe you could play around with the initial capacity a bit to see if that has any noticeable impact. As for the implementations themselves; they were written in the JDK 1.6 era with the JIT's capabilities at that time in mind. Maybe some things could be written slightly differently today, but that would require some looking in to and the gains likely won't be huge. These data structures are heavily used, which is why a significant amount of time is spend in their methods in the trace.

As for the GC, looking at that graph, it doesn't seem to be an issue. While G1 can offload work to worker threads when worst-case behavior pops-up (which may not be properly reflected in the GC stats), it doesn't seem to be what's happening here.

Can we maybe narrow down this issue by having a look to find out if there are specific pieces of syntax or certain grammars it struggles more with than others?

Also, maybe looking at the parser's trace logging can give us some clue as to what it's doing specifically. Maybe we can spot something out of the ordinary in there. So we can have a more targeted look at the implementation. It should print expands, reduces and propagations (propagations are a right nullable fix thing) for specific input locations/ranges, which should be relatively easily relatable to locations in a source file. This could also potentially point us to improvements that could be made elsewhere as well, if such a possibility exists. As an example: it could indicate the parser is dealing with local ambiguities in a grammar which are resolved at a fairly late point in the parse, causing it to drag along certain stacks for longer than may be necessary. This may be resolved by modifying a rule in a grammar.

jurgenvinju commented 1 year ago

But in terms of this specific issue. Why not pre-parse the standard library for Rascal, save the ASTs and ship them with the distribution? Just use the name + hash of the rsc file when saving the ASTs, so the cached version won't be used in case you are modifying the Rascal source.

The process would, for example, look like:

Requests parse of ParserGenerator.rsc

Calculate hash for ParserGenerator.rsc (e.g.: d5gq4w4g4r7jujsg43y6)

Look for ParserGenerator-d5gq4w4g4r7jujsg43y6-cached.ast

Found -> Use cached AST

Not found -> Parse ParserGenerator.rsc

According to my analysis that would only help a bit. The parser generator code is loaded for the grammar reification operator anyway.

jurgenvinju commented 1 year ago

It may be the case that 50% of the time in parsing string templates is spent in temporary soon to be dead stacks. I'll have a look in the grammar. The big module has one very long string templates. Nevertheless it remains the module that trains the JIT , so it will always be the slower module with the current setup.

DavyLandman commented 1 year ago

As for the custom data structures. These are basically specialized implementations aimed as reducing memory usage and improving performance. Maybe you could play around with the initial capacity a bit to see if that has any noticeable impact. As for the implementations themselves; they were written in the JDK 1.6 era with the JIT's capabilities at that time in mind. Maybe some things could be written slightly differently today, but that would require some looking in to and the gains likely won't be huge. These data structures are heavily used, which is why a significant amount of time is spend in their methods in the trace.

I think it all looks good to me. I was wondering (since the profile shows the .get on the IntegerMap and IntegerKeyedDoubleValueHashMap) about your trade-offs during the design of the Int keyed maps? They appear to use a linked list for hash-collision bucket (aka seperate chaining), if we have lots of hash collisions, could that explain the relative amount of time caused in .get? Where if we would use open addressing we might have more locality during the probing of next collision? I've looked at a few other Int keyed hashmaps, and they all seem to use a seperate array for the int values, do some form of open addressing on there, and then use the resulting index to deref the separate value array. Do you remember your experiments at the time? Or would you like to spend some time right now on this? ;)

arnoldlankamp commented 1 year ago

Sure, you could have a look at on open addressing implementation for the hashmaps. It may well work well in this case, since the keys are generally integers that are associated with an input location, so they should distribute pretty uniformly if you use a power of two for the table size and hash them by simply dividing by the table's size.

arnoldlankamp commented 1 year ago

Related to usethesource/rascal#1813.

It may be interesting to find out why the enlarge method of the EdgeSet has such a large impact on the total run time. For some reason the add function seems to be called a lot. Calling add on the EdgeSet is an indication of a parse stack merge. And stacks can only merge if they first split, so why is this happening so often that it takes up a significant amount of the execution time 🤔 ?

jurgenvinju commented 1 year ago

But in terms of this specific issue. Why not pre-parse the standard library for Rascal, save the ASTs and ship them with the distribution? Just use the name + hash of the rsc file when saving the ASTs, so the cached version won't be used in case you are modifying the Rascal source.

The process would, for example, look like:

Requests parse of ParserGenerator.rsc

Calculate hash for ParserGenerator.rsc (e.g.: d5gq4w4g4r7jujsg43y6)

Look for ParserGenerator-d5gq4w4g4r7jujsg43y6-cached.ast

Found -> Use cached AST

Not found -> Parse ParserGenerator.rsc

This is also worth investigating.

jurgenvinju commented 1 year ago

Have to figure out how long it takes to read in the parse trees from disk. Generally since they are about four times as big as the input sentence, you get little out of it. But at least the process is deterministic.

jurgenvinju commented 1 year ago

Related to usethesource/rascal#1813.

It may be interesting to find out why the enlarge method of the EdgeSet has such a large impact on the total run time. For some reason the add function seems to be called a lot. Calling add on the EdgeSet is an indication of a parse stack merge. And stacks can only merge if they first split, so why is this happening so often that it takes up a significant amount of the execution time 🤔 ?

My guess is that the string template grammar could use a little more lookahead; perhaps it is constantly accepting empty lists or reducing early only to find out the " hasn't been reached yet. Early list reduction would lead to such merges I guess.

jurgenvinju commented 1 year ago

Related to usethesource/rascal#1813.

It may be interesting to find out why the enlarge method of the EdgeSet has such a large impact on the total run time. For some reason the add function seems to be called a lot. Calling add on the EdgeSet is an indication of a parse stack merge. And stacks can only merge if they first split, so why is this happening so often that it takes up a significant amount of the execution time 🤔 ?

Another cause could be a cyclic grammar. If that happens in whitespace we'd never see it when interpreting the tree, but we'd pay for it nevertheless.

arnoldlankamp commented 1 year ago

Have to figure out how long it takes to read in the parse trees from disk. Generally since they are about four times as big as the input sentence, you get little out of it. But at least the process is deterministic.

Yes, the parse trees are quite a big bigger, but reading their binary format from disc is a process that should scale fairly linearly with file size and should be quicker than parsing the source file, since parsing is a lot more compute heavy.

How much the difference will be I can only guess, but writing a benchmark for reading in a parse tree from disc should be a relatively easy thing to do (maybe you can even time it in the REPL or something?). If the difference is large enough, implementing a caching solution like this could resolve the issue without too much time investment.

arnoldlankamp commented 1 year ago

Related to usethesource/rascal#1813. It may be interesting to find out why the enlarge method of the EdgeSet has such a large impact on the total run time. For some reason the add function seems to be called a lot. Calling add on the EdgeSet is an indication of a parse stack merge. And stacks can only merge if they first split, so why is this happening so often that it takes up a significant amount of the execution time 🤔 ?

My guess is that the string template grammar could use a little more lookahead; perhaps it is constantly accepting empty lists or reducing early only to find out the " hasn't been reached yet. Early list reduction would lead to such merges I guess.

I'm not familiar with what the string template grammar looks like, but to give some background calling add on the EdgeSet happens in the following two cases: Consider the grammar: A := BC

If there are lots of active stacks with an A for the current input position (e.g.: .AX, .AY, Z.AW etc.); each will get an edge from the newly expanded B they share.
If there are lots of successfully matched B's with a different start position preceding the current input position; the C for the current input position will in this case inherit all edges from all preceding B's to their corresponding A's.

Passing a debug listener to the parser might give us some information about what's going on. It will probably invoke more moving or expanding calls than expected for possible culprits.

arnoldlankamp commented 1 year ago

Having said that, it may also well be that is turns out nothing out of the ordinary is happening when we look at the logging, but if large EdgeSets are being created, like the improvement the Arrays.copyOf change accomplished suggest, something unexpected may be happening and it at least warrants a look 🤔 .

DavyLandman commented 1 year ago

Sure, you could have a look at on open addressing implementation for the hashmaps. It may well work well in this case, since the keys are generally integers that are associated with an input location, so they should distribute pretty uniformly if you use a power of two for the table size and hash them by simply dividing by the table's size.

I just ran some stats on all get of the IntegerKeyedDoubleValueHashMap. Here is how often we had to walk the chain (so a measure of how deep the hash collision nodes were).

So in 369324 of the total 1204224 gets did we end up in a node with 1 or more hash collisions. So 30% of the gets are most likely multiple cache misses.

I'm glad to see they aren't that deep, but it might still pay off to see if some open addressing could make that a bit more close to each other.

jurgenvinju commented 1 year ago

If I run ParserGenerator through the normal parser, generated parsers in ParseTree, then I get consistently fast behavior:

rascal>() { parse(#start[Module], |file:///Users/jurgenv/git/rascal/src/org/rascalmpl/library/lang/rascal/grammar/ParserGenerator.rsc|); }()
Parsing: 0
Unbinarizing, post-parse filtering, and mapping to UPTR: 0
Parsing: 0
Unbinarizing, post-parse filtering, and mapping to UPTR: 0
Parsing: 126
Unbinarizing, post-parse filtering, and mapping to UPTR: 18
ok

This is explainable since the parser superclass has already been optimized by the JIT having to have parsed ParserGenerator already earlier to be able to generate the current parser.

jurgenvinju commented 1 year ago

Reading the parse tree from disk:

rascal>cpuTimeOf(() { readBinaryValueFile(#start[Module], |home:///test.bin|); }) 
Parsing: 0
Unbinarizing, post-parse filtering, and mapping to UPTR: 0
Parsing: 0
Unbinarizing, post-parse filtering, and mapping to UPTR: 0
int: 78985000

So that's 80ms. compared to 126 (but also so often lower). We will not win the battle with caching the parse trees on disk, even though it is about 40% faster. It remains something to consider, but for know it seems making sure the parser generator is not loaded at all is still the most viable option for solving this issue.

DavyLandman commented 1 year ago

So in 369324 of the total 1204224 gets did we end up in a node with 1 or more hash collisions. So 30% of the gets are most likely multiple cache misses.

I'm glad to see they aren't that deep, but it might still pay off to see if some open addressing could make that a bit more close to each other.

I've butchered the code to use linear probing, and also tried out with different existing IntObjectMaps, they are all getting to the same performance numbers. So I think only by making a new dedicated data structure that tries it's best to explain to the jitter about range checks etc, could we maybe squeeze out a bit more here. But the current code, even with the 30% of hash collisions, is fast, and will take some effort to improve upon.

jurgenvinju commented 1 year ago

for concrete syntax modules, it may help to serialize a parse tree after inlining the concrete syntax fragments. This way the modules do not require a parser generator anymore. As suggested by @arnoldlankamp by writing a file called <ModuleName>-<hash>.tree or something.
maybe a "low brow" solution is that we add this to ParseTree.rsc
- void saveParser(#NonTerminal, |location://to/file|)
- Parser loadParser(|location://to/file|)

This would not expose directly to the Rascal level what kind of encoding we are using to encode the parser, and it could be used to avoid the use of the # operator completely at run-time. The code itself would need to switch between the two modes somehow.

arnoldlankamp commented 1 year ago

So in 369324 of the total 1204224 gets did we end up in a node with 1 or more hash collisions. So 30% of the gets are most likely multiple cache misses. I'm glad to see they aren't that deep, but it might still pay off to see if some open addressing could make that a bit more close to each other.

I've butchered the code to use linear probing, and also tried out with different existing IntObjectMaps, they are all getting to the same performance numbers. So I think only by making a new dedicated data structure that tries it's best to explain to the jitter about range checks etc, could we maybe squeeze out a bit more here. But the current code, even with the 30% of hash collisions, is fast, and will take some effort to improve upon.

Well it was worth a try. In theory open addressing could have been a good fit or this specific use-case.

In the past I've also found that open addressing may not always perform better than a chaining implementation. The reasons I found back then were the following:

The locality argument, while sound, may not always apply, since (on the JVM) consecutively allocated objects are generally co-located in memory (and are also likely to be moved in unison by the GC during generational promotion or heap compaction). So whether or not an open addressing implementation performs better than a chaining one in this regard is highly dependent on the allocation pattern of the application. Often data structures are filled in either one go or in bulk (by the same thread and likely largely allocated from the same TLAB), so memory locality and potential cache misses may not turn out to be a problem in practice.
Hash tables employing open addressing are more likely to exhibit or approach worst-case (O(N)) behavior than implementation that employ chaining. The reason for this is two-fold:
- Consider the following worst-case example: we have a table with a capacity of 128 (and a load factor of 1), of which the first 127 entries are filled (without any hash collisions), now we want to add another entry, which hashes to index 0 and collides. When using linear probing we'd have to walk the entire table to find a free spot. Ditto when retrieving; we'd have to walk the entire table when trying to find this specific entry. In a hash table implementation employing chaining this collision would just be one hop away.
- In tables that employ open addressing collisions are resolved by storing entries elsewhere in the table, providing an additional source of potential collisions, besides just hash collisions.
Hash tables employing open addressing generally need to maintain a lower load factor than tables using chaining to reduce performance spikes (for the reasons mentioned above), so can turn out to have larger memory footprint. The rationale for decreasing the load factor is that this increases the amount of free space in the table, somewhat offsetting the algorithm's inherent increased chance of collisions.

DavyLandman commented 1 year ago

yup, agreed, that was roughly the path I took. Decrease the load factor to have less clustering.

jurgenvinju commented 1 year ago

https://github.com/usethesource/rascal/pull/1815 this adds the fundamental capability of saving and storing "parser functions" on disk.

The API does not show what the file type is supposed to be. In the current implementation it will be a JVM .class file.
Future implementations of the builtin functions in the ParseTree module may choose different formats.
The compiler can use this API for "compiling" parsers for statically known grammars
Users can use this API for reusing previously generated parsers without going through loading the parser generator and generating the parser
The interpreter may use this API for caching parsers for modules with concrete syntax in jar files such that loading modules becomes much faster.

DavyLandman commented 1 year ago

Nice!

Do you have an idea of how we could trigger this caches (the once in the evaluator for concrete syntax)? Or is that still WIP?

jurgenvinju commented 1 year ago

WIP.

But I'm thinking of a file exists check while importing the module. Makes sense?
and for storing I was thinking a small mvn plugin that would store those files using a secret interpreter mode.

DavyLandman commented 1 year ago

if it's automatic, that would be nice, than it's just an aspect of packaging :)

arnoldlankamp commented 1 year ago

and for storing I was thinking a small mvn plugin that would store those files using a secret interpreter mode.

You may be able to use the Exec Maven Plugin for this. This plugin allows you to execute a Java program in a specific build phase. Like generate-sources (if you'd like to have the cache available during development) or package (if you don't want the build time overhead during development, but only ship the generated cache in the distribution).

jurgenvinju commented 1 year ago

https://github.com/usethesource/rascal/pull/1815 is ready for review if anyone is interested. it should go a long way in preventing the bootstrapping of a parser generator, if applied to both the parsing of concrete syntax fragments and the parsing of DSL files.

In a DSL parser, the parser contributions would not use p = parser(#start[Program]) but rather p = loadParser(|lib:///myProject/myCachedParserLocation|) and the call to p would use a manually constructed reified type to avoid loading the type reifier (which is part of the parser generator): p(type(\start(sort("Program")), ()), input, src)

In the concrete syntax fragment parser we'd do a similar trick. That's for another PR.

The docs of ParseTree.rsc contain an example usage of storeParsers and saveParsers. The lang::rascal::tests::concrete::Parsing contains a single test.

jurgenvinju commented 1 year ago

if it's automatic, that would be nice, than it's just an aspect of packaging :)

The overhead of packaging all the class files from the in-memory file system into one binary file is significant. So I can't make it automatic inside the interpreter. It could be a default side-effect of the type checker though, to write a parser file next to each .tpl file.

usethesource / rascal-language-servers

Prevent "bootstrapping" a full parser generator for deployed DSL implementations #267