lalrpop / lalrpop

LR(1) parser generator for Rust
http://lalrpop.github.io/lalrpop/
Apache License 2.0
3.07k stars 292 forks source link

larger grammars generate a LOT of code #65

Closed timglabisch closed 6 years ago

timglabisch commented 8 years ago

Hello,

i try to parse php expressions using lalrpop: https://github.com/timglabisch/rustphp/blob/c79060d6495a55174fc2ad5710d5774d8ec94d67/src/calculator1.lalrpop

my problem is that cargo run becomes incredible slow:

time cargo run 1234.76s user 19.88s system 99% cpu 20:58.78 total

the file is very huge (~50mb):

cat src/calculator1.rs | wc -l 1044621

is there something fundamentally wrong or is this expected?

nikomatsakis commented 8 years ago

To be clear, is your concern compilation time (i.e., cargo build) or execution time (i.e., time to parse a PHP file). I've definitely noticed that the recursive ascent algorithm that LALRPOP uses generates quite a lot of code, which then correspondingly takes quite a lot of time to compile. You can help things by switching to the LALR(1) algorithm:

grammar["LALR(1)"];

I have been wanting to implement some other improvements as well (e.g., generalized left corner parsing or other algorithms beyond LALR(1)). Another option is to add a second table-driven mode, but that will cause the parser itself to run more slowly.

timglabisch commented 8 years ago

sorry for not describing this clear, my concern is just compilation time.

nikomatsakis commented 8 years ago

@timglabisch ok, then yes it's known -- I'm hoping to put in some effort to fix it in the near term, since I want to get a LALRPOP version of the Rust grammar up and going too (and it's going to have the same problem).

nikomatsakis commented 8 years ago

I modified the issue title to be more precise.

sharpjs commented 8 years ago

This was a sore point for me as well when I tried LALRPOP. Even my small grammar (C-ish expressions) produced a very large .rs file, mostly comments, that took over a minute to compile, IIRC. At the time, it seemed to me that the output size was increasing exponentially with respect to the grammar size, so I discontinued use of LALRPOP for the time being.

I'd like to see this addressed if possible, because I'm one of those crazy people who feel more comfortable with LALR(1) than LL(1). Also, I like LALRPOP's grammar syntax and features.

nikomatsakis commented 8 years ago

I think probably the easiest fix to start is to make a table-driven parser mode, which ought to be much more compressed. I'll probably try to hack that up in the near future. Should be relatively easy.

I'd prefer to do some of the fancier work to make recursive ascent more compact, but that'll be harder. :)

slde-flash commented 8 years ago

from my point of view it would help A LOT if there would be some kind of "fast" mode for developing and a slower (may up to hours ...) to really compile it.

fhahn commented 8 years ago

@nikomatsakis It seems like lalrpop::lr1::interpret basically provides a table-driven parser mode, right? At the moment it's just used for testing and just generates a ParseTree, but shouldn't it be fairly straight forward to integrate calling the action functions on reduce? If that's the case, wouldn't we be able to emit a table driven parser by dumping the table and some additional code that invokes the modified interpreter?

I'm probably missing something, but if the general idea makes sense I could look into that in the following days.

nikomatsakis commented 8 years ago

@fhahn this is roughly correct, the main catch is that you would have to write something that generates the code to run the interpreter, since the interpreter runs at 'build.rs' time. Something like emit_recursive_ascent. I'd love some help with this, if you have the time, happy to answer any questions. Should be relatively straightforward I think, the main thing is probably that there will be some refactoring to do since this'll be the first time that we write a second backend (and we have to decide how to configure it).

On Tue, Feb 02, 2016 at 02:57:10AM -0800, Florian Hahn wrote:

@nikomatsakis It seems like lalrpop::lr1::interpret basically provides a table-driven parser mode, right? At the moment it's just used for testing and just generates a ParseTree, but shouldn't it be fairly straight forward to integrate calling the action functions on reduce? If that's the case, wouldn't we be able to emit a table driven parser by dumping the table and some additional code that invokes the modified interpreter?

I'm probably missing something, but if the general idea makes sense I could look into that in the following days.


Reply to this email directly or view it on GitHub: https://github.com/nikomatsakis/lalrpop/issues/65#issuecomment-178508098

fhahn commented 8 years ago

Great, I'll look into it the next couple of days and will ping you when I made some progress (or got stuck terribly)

TyOverby commented 8 years ago

Has there been any progress on this? I'm working on creating a parser for a scripting language, and progress has completely stopped because generating and compiling syntax.rs takes roughly 6 minutes.

I have a 207-line syntax.lalrpop file that becomes a 931,410-line syntax.rs. Granted, much of those are comments, but rustc can't seem to handle it.

nikomatsakis commented 8 years ago

As far as I know there has been no progress -- @fhahn, any news? I've been working hard on error reporting, but that's coming to a close, so after that I'll turn to this as next top priority (unless I hear otherwise from @fhahn).

nikomatsakis commented 8 years ago

@TyOverby two questions:

  1. Have you attempted to use the LALR(1) mode? To use LALR(1) mode, change the grammar declaration to grammar["LALR(1)"];. This may help because it reduces the number of states, though it can cause spurious conflicts (at some point I want to improve the LR(1) generator to use the LALR(1) state set so long as no conflicts would be reported, but doing that right is a more complex endeavor).
  2. I've also recently noticed that if you do cargo build, then cargo builds and executes your build.rs file in debug mode -- this is a big problem for LALRPOP. cargo build --release, in contrast, will run LALRPOP in release mode, which is WAY faster. So you might try doing a release build and see what happens!
fhahn commented 8 years ago

@nikomatsakis I started looking into it but did not have too much time over the last week. I think we should decide how to emit the parse table before continuing. As far as I can tell we could either emit a serialized version which is deserialized when the parser is started or emit code which builds up the table explicitly. What would you prefer? I could probably work on this on the next weekend.

TyOverby commented 8 years ago

@nikomatsakis

I am already using LALR(1). I had it on LR(1) until a build took half an hour to complete.

Thanks for the cargo build --release advice. Here's timings for various scenarios:

cargo build

After touching syntax.lalrpop: 7m17s After touching lib.rs: 47s

cargo build --release

After touching syntax.lalrpop: 2m18s After touching lib.rs: 1m45s

Relative Timings

Debug

lalrpop: 6m30s rustc: 47s

Release

lalrpop: 33s rustc: 1m45s

I'm a huge fan of debug builds, so clearly we need a way to tell cargo to use release versions of build.rs scripts during a debug build.

In the mean-time, I could just use the command-line version of lalrpop compiled with release-mode.

47s is much nicer, but it would be great if codegen produced something that rustc could compile faster.

TyOverby commented 8 years ago

@fhahn If you want to serialize and deserialize large structures efficiently, check out bincode. With a derive(derive(RustcEncodable, RustcDecodable) you get a structure that can be serialized/deserialized very quickly and with 0 size overhead.

nikomatsakis commented 8 years ago

@fhahn I would expect to generate a large constant array, but perhaps that too would have large compile times. It'd have the advantage of being faster to parse (no startup deserialization cost). I guess the real answer is that we ought to experiment! In any case, how we generate the table seems like something we can tweak. I figure you'll have two twin stacks -- one for the states, one with a Data enum that wraps up all the possible nonterminals and tokens -- and we'll just have to generate glue code for invoking actions that does a match to downcast appropriately.

fhahn commented 8 years ago

That sounds good. Some parts probably cannot be created as consts (like maps mapping nonterminals to states), but we should be able to generate them at the beginning of main.

I think as the first step I should start with creating data structures for states and data that can be stored in constant arrays and then continue with code that emits those constant arrays. Then we can adopt the interpreter code to work with the new table layout.

joris-r commented 8 years ago

I also ran into this problem with my grammar especially when i was using the integrated lalrpop lexer. At those time the parser generated source file was about 17 MB. Now I have switch to a custom lexer, the size is about 0.5 MB (so it's much better).

So it seems to me the way the lexer is generated add a lot of weight.

nikomatsakis commented 8 years ago

@joris-r interesting, I hadn't really looked at the code generation due to the lexer (all of my big examples wind up with a custom lexer anyhow)

LukasKalbertodt commented 8 years ago

I'm a huge fan of debug builds, so clearly we need a way to tell cargo to use release versions of build.rs scripts during a debug build.

Is there any progress on that? I can't find an issue in the cargo repo. Maybe we should request that feature?

nikomatsakis commented 8 years ago

We probably ought to request the feature. I had planned a workaround where you can cargo install lalrpop and it will use the executable from your path, but that is really suboptimal. On Mar 1, 2016 04:20, Lukas Kalbertodt notifications@github.com wrote: I'm a huge fan of debug builds, so clearly we need a way to tell cargo to use release versions of build.rs scripts during a debug build.

Is there any progress on that? I can't find a issue in the cargo repo. Maybe we should request that feature?

—Reply to this email directly or view it on GitHub.

TyOverby commented 8 years ago

I thought I already made one, but I guess I didn't. I'll open up an issue now.

TyOverby commented 8 years ago

https://github.com/rust-lang/cargo/issues/2424

pothos commented 8 years ago

For the meantime you can put this in your build.rs:

extern crate lalrpop;
use std::process::Command;

fn main() {
    let s = Command::new("lalrpop").arg("src/grammar.lalrpop").status();
    if s.is_err() || !s.unwrap().success() {
        lalrpop::process_root().unwrap();
    }
}

edit: add fallback

TyOverby commented 8 years ago

@pothos Or better yet: have it use the command line tool and fall back to using the typical library approach.

osa1 commented 8 years ago

I was very excited when I discovered LALRPOP, mostly because of awesome error messages. Unfortunately, even with 148 LOC grammar it's basically unusable, because of crazy compile times. Is there any ongoing work to improve this?

osa1 commented 8 years ago

Weird thing is, even if I don't change anything in my lalrpop file, recompilation takes huge amounts of time. Any ideas about this? I'm done with parsing but unfortunately unless I make a crate only for the parser the whole projects get unbearably slow.

nixpulvis commented 8 years ago

I find the crate for the parser technique to be pretty nice actually. I have my project split effectively into syntax, and semantics. But yea, faster compiles would be nice either way.

DemiMarie commented 8 years ago

You can use sed to remove the comments from the generated file (assuming that there are no nested /* comments) or have build.rs do this.

nikomatsakis commented 8 years ago

https://github.com/nikomatsakis/lalrpop/pull/122 now offers a table-driven code generation style which I hope will help address these issues. I haven't had time to do much in the way of comparisons though. If someone is able to give it a spin and report back, I'd be interested to hear about your experiences! :)

nikomatsakis commented 8 years ago

Just as a data-point, just measuring the final call to rustc, the Pascal demo in LALRPOP takes the following timings:

rustc-timings Debug Release
Table-driven 10s 17s
Ascent 67s 182s

Note that we are still generating the lexer internally; this could probably be converted to a table-driven style as well for further gains.

nikomatsakis commented 8 years ago

122 is merged now. So it should be easy to give it a spin. Note that there are some breaking changes (documented in the release notes). In particular, to request a LALR(1) parser, you should now write #[LALR] grammar and not grammar["LALR(1)"].

TyOverby commented 8 years ago

Could we get a cargo release with #122?

nikomatsakis commented 8 years ago

@TyOverby yeah, I had just hoped to get some feedback that it works for someone besides me. =)

nikomatsakis commented 8 years ago

Just FYI, some measurements with pascal.rs suggest recursive ascent is about 30% faster. Of course it will depend on a lot of factors.

pothos commented 8 years ago

Somehow I can't compile anymore now. It says

src/grammar.lalrpop:9:1: 9:6 error: unexpected token: `extern`
  extern {

refering to my own Tok data structure which I am using:

use tok::Tok;

extern {
    enum Tok {
        "(" => Tok::LPAR{location: <String>, value: <String>},
…

edit: solved, grammar; was missing

nikomatsakis commented 8 years ago

Link to the source? (Actually could you open a separate issue?) On Aug 3, 2016 9:40 PM, pothos notifications@github.com wrote:Somehow I can't compile anymore now. It says

src/grammar.lalrpop:9:1: 9:6 error: unexpected token: extern extern {

refering to my own Tok data structure which I am using:

use tok::Tok;

extern { enum Tok { "(" => Tok::LPAR{location: , value: }, …

—You are receiving this because you were mentioned.Reply to this email directly, view it on GitHub, or mute the thread.

osa1 commented 8 years ago

Thanks for the patch @nikomatsakis . I'm going to try it sometime today and report back.

pothos commented 8 years ago

I still have to stick to

#[LALR]
#[recursive_ascent] grammar;

which yields a 19MB file. Otherwise I get an OOM with rustc or something worse , because grammar; results in a file size of 190MB and just #[LALR] grammar; results in a file size of 57MB and later throws

this function takes 1 parameter but 2 parameters were supplied [E0061]
src/grammar.rs:368736  Tok::STRING { location: __tok0, value: __tok1 } => __Symbol::TermSTRING(__tok0, __tok1),
nikomatsakis commented 8 years ago

New version published. @pothos, I'm not sure what's up with the OOM -- I guess rustc isn't coping well with a big array. The 190MB file is probably largely comments, would be my guess, but I'm not sure. I can maybe try to investigate.

DemiMarie commented 8 years ago

Should LALRPOP automatically emit comments? I think it should not unless told to (for debugging LALRPOP).

osa1 commented 8 years ago

I finally tried this. With grammar; I get 51383 loc Rust file, with #[LALR] grammar; I get a 40508 loc one. Either way it's huge IMHO. Compile times still suffer.

nikomatsakis commented 7 years ago

Now that we're using the new lexer, we should be generated a lot less code as well. I'd be curious to see some new measurements. Though I probably need to do a new public release first.

joris-r commented 7 years ago

Well, I'm not sure about how the new lexer is better exactly. But I can say it's way much better than one year ago! At those time (the version of lalrpop was 0.10.0 I think) I had something like 0.5 MB with LALR and a custom lexer , now I have 0.18 MB for the generated file using the generated lexer. And with LR, I had 4.1 MB, and now 0.64 MB so it's quite better also.

nikomatsakis commented 7 years ago

@joris-r great :)

nikomatsakis commented 6 years ago

Going to close this, i think we've made a lot of progress here.

tanin47 commented 5 years ago

I'm super late to the party. But I want to add what I've found:

Avoid use pub frivolously. Because one pub creates one parser. I reduced the number of pub from 20 to 2, and the number of lines went from ~2M to ~20K.

Marwes commented 5 years ago

@tanin47 If multiple pub parsers are needed it is still possible to coax LALRPOP into generating one parser for all of them using the approach in https://github.com/lalrpop/lalrpop/pull/414.

tanin47 commented 5 years ago

@Marwes could you elaborate more how to achieve this? I'm not sure I understand how conditional compilation should be used.