Closed timglabisch closed 6 years ago
To be clear, is your concern compilation time (i.e., cargo build
) or execution time (i.e., time to parse a PHP file). I've definitely noticed that the recursive ascent algorithm that LALRPOP uses generates quite a lot of code, which then correspondingly takes quite a lot of time to compile. You can help things by switching to the LALR(1) algorithm:
grammar["LALR(1)"];
I have been wanting to implement some other improvements as well (e.g., generalized left corner parsing or other algorithms beyond LALR(1)). Another option is to add a second table-driven mode, but that will cause the parser itself to run more slowly.
sorry for not describing this clear, my concern is just compilation time
.
@timglabisch ok, then yes it's known -- I'm hoping to put in some effort to fix it in the near term, since I want to get a LALRPOP version of the Rust grammar up and going too (and it's going to have the same problem).
I modified the issue title to be more precise.
This was a sore point for me as well when I tried LALRPOP. Even my small grammar (C-ish expressions) produced a very large .rs file, mostly comments, that took over a minute to compile, IIRC. At the time, it seemed to me that the output size was increasing exponentially with respect to the grammar size, so I discontinued use of LALRPOP for the time being.
I'd like to see this addressed if possible, because I'm one of those crazy people who feel more comfortable with LALR(1) than LL(1). Also, I like LALRPOP's grammar syntax and features.
I think probably the easiest fix to start is to make a table-driven parser mode, which ought to be much more compressed. I'll probably try to hack that up in the near future. Should be relatively easy.
I'd prefer to do some of the fancier work to make recursive ascent more compact, but that'll be harder. :)
from my point of view it would help A LOT if there would be some kind of "fast" mode for developing and a slower (may up to hours ...) to really compile it.
@nikomatsakis It seems like lalrpop::lr1::interpret
basically provides a table-driven parser mode, right? At the moment it's just used for testing and just generates a ParseTree, but shouldn't it be fairly straight forward to integrate calling the action functions on reduce?
If that's the case, wouldn't we be able to emit a table driven parser by dumping the table and some additional code that invokes the modified interpreter?
I'm probably missing something, but if the general idea makes sense I could look into that in the following days.
@fhahn this is roughly correct, the main catch is that you would have
to write something that generates the code to run the interpreter,
since the interpreter runs at 'build.rs' time. Something like
emit_recursive_ascent
. I'd love some help with this, if you have the
time, happy to answer any questions. Should be relatively
straightforward I think, the main thing is probably that there will be
some refactoring to do since this'll be the first time that we write a
second backend (and we have to decide how to configure it).
On Tue, Feb 02, 2016 at 02:57:10AM -0800, Florian Hahn wrote:
@nikomatsakis It seems like
lalrpop::lr1::interpret
basically provides a table-driven parser mode, right? At the moment it's just used for testing and just generates a ParseTree, but shouldn't it be fairly straight forward to integrate calling the action functions on reduce? If that's the case, wouldn't we be able to emit a table driven parser by dumping the table and some additional code that invokes the modified interpreter?I'm probably missing something, but if the general idea makes sense I could look into that in the following days.
Reply to this email directly or view it on GitHub: https://github.com/nikomatsakis/lalrpop/issues/65#issuecomment-178508098
Great, I'll look into it the next couple of days and will ping you when I made some progress (or got stuck terribly)
Has there been any progress on this? I'm working on creating a parser for a scripting language, and progress has completely stopped because generating and compiling syntax.rs
takes roughly 6 minutes.
I have a 207-line syntax.lalrpop file that becomes a 931,410-line syntax.rs. Granted, much of those are comments, but rustc can't seem to handle it.
As far as I know there has been no progress -- @fhahn, any news? I've been working hard on error reporting, but that's coming to a close, so after that I'll turn to this as next top priority (unless I hear otherwise from @fhahn).
@TyOverby two questions:
grammar["LALR(1)"];
. This may help because it reduces the number of states, though it can cause spurious conflicts (at some point I want to improve the LR(1) generator to use the LALR(1) state set so long as no conflicts would be reported, but doing that right is a more complex endeavor).cargo build
, then cargo builds and executes your build.rs
file in debug mode -- this is a big problem for LALRPOP. cargo build --release
, in contrast, will run LALRPOP in release mode, which is WAY faster. So you might try doing a release build and see what happens!@nikomatsakis I started looking into it but did not have too much time over the last week. I think we should decide how to emit the parse table before continuing. As far as I can tell we could either emit a serialized version which is deserialized when the parser is started or emit code which builds up the table explicitly. What would you prefer? I could probably work on this on the next weekend.
@nikomatsakis
I am already using LALR(1). I had it on LR(1) until a build took half an hour to complete.
Thanks for the cargo build --release
advice. Here's timings for various scenarios:
cargo build
After touching syntax.lalrpop: 7m17s After touching lib.rs: 47s
cargo build --release
After touching syntax.lalrpop: 2m18s After touching lib.rs: 1m45s
lalrpop: 6m30s rustc: 47s
lalrpop: 33s rustc: 1m45s
I'm a huge fan of debug builds, so clearly we need a way to tell cargo to use release versions of build.rs scripts during a debug build.
In the mean-time, I could just use the command-line version of lalrpop compiled with release-mode.
47s is much nicer, but it would be great if codegen produced something that rustc could compile faster.
@fhahn If you want to serialize and deserialize large structures efficiently, check out bincode. With a derive(derive(RustcEncodable, RustcDecodable)
you get a structure that can be serialized/deserialized very quickly and with 0 size overhead.
@fhahn I would expect to generate a large constant array, but perhaps that too would have large compile times. It'd have the advantage of being faster to parse (no startup deserialization cost). I guess the real answer is that we ought to experiment! In any case, how we generate the table seems like something we can tweak. I figure you'll have two twin stacks -- one for the states, one with a Data
enum that wraps up all the possible nonterminals and tokens -- and we'll just have to generate glue code for invoking actions that does a match to downcast appropriately.
That sounds good. Some parts probably cannot be created as consts (like maps mapping nonterminals to states), but we should be able to generate them at the beginning of main
.
I think as the first step I should start with creating data structures for states and data that can be stored in constant arrays and then continue with code that emits those constant arrays. Then we can adopt the interpreter code to work with the new table layout.
I also ran into this problem with my grammar especially when i was using the integrated lalrpop lexer. At those time the parser generated source file was about 17 MB. Now I have switch to a custom lexer, the size is about 0.5 MB (so it's much better).
So it seems to me the way the lexer is generated add a lot of weight.
@joris-r interesting, I hadn't really looked at the code generation due to the lexer (all of my big examples wind up with a custom lexer anyhow)
I'm a huge fan of debug builds, so clearly we need a way to tell cargo to use release versions of build.rs scripts during a debug build.
Is there any progress on that? I can't find an issue in the cargo
repo. Maybe we should request that feature?
We probably ought to request the feature. I had planned a workaround where you can cargo install lalrpop and it will use the executable from your path, but that is really suboptimal. On Mar 1, 2016 04:20, Lukas Kalbertodt notifications@github.com wrote: I'm a huge fan of debug builds, so clearly we need a way to tell cargo to use release versions of build.rs scripts during a debug build.
Is there any progress on that? I can't find a issue in the cargo repo. Maybe we should request that feature?
—Reply to this email directly or view it on GitHub.
I thought I already made one, but I guess I didn't. I'll open up an issue now.
For the meantime you can put this in your build.rs:
extern crate lalrpop;
use std::process::Command;
fn main() {
let s = Command::new("lalrpop").arg("src/grammar.lalrpop").status();
if s.is_err() || !s.unwrap().success() {
lalrpop::process_root().unwrap();
}
}
edit: add fallback
@pothos Or better yet: have it use the command line tool and fall back to using the typical library approach.
I was very excited when I discovered LALRPOP, mostly because of awesome error messages. Unfortunately, even with 148 LOC grammar it's basically unusable, because of crazy compile times. Is there any ongoing work to improve this?
Weird thing is, even if I don't change anything in my lalrpop file, recompilation takes huge amounts of time. Any ideas about this? I'm done with parsing but unfortunately unless I make a crate only for the parser the whole projects get unbearably slow.
I find the crate for the parser technique to be pretty nice actually. I have my project split effectively into syntax, and semantics. But yea, faster compiles would be nice either way.
You can use sed
to remove the comments from the generated file (assuming that there are no nested /*
comments) or have build.rs
do this.
https://github.com/nikomatsakis/lalrpop/pull/122 now offers a table-driven code generation style which I hope will help address these issues. I haven't had time to do much in the way of comparisons though. If someone is able to give it a spin and report back, I'd be interested to hear about your experiences! :)
Just as a data-point, just measuring the final call to rustc
, the Pascal demo in LALRPOP takes the following timings:
rustc-timings | Debug | Release |
---|---|---|
Table-driven | 10s | 17s |
Ascent | 67s | 182s |
Note that we are still generating the lexer internally; this could probably be converted to a table-driven style as well for further gains.
#[LALR] grammar
and not grammar["LALR(1)"]
.Could we get a cargo release with #122?
@TyOverby yeah, I had just hoped to get some feedback that it works for someone besides me. =)
Just FYI, some measurements with pascal.rs suggest recursive ascent is about 30% faster. Of course it will depend on a lot of factors.
Somehow I can't compile anymore now. It says
src/grammar.lalrpop:9:1: 9:6 error: unexpected token: `extern`
extern {
refering to my own Tok data structure which I am using:
use tok::Tok;
extern {
enum Tok {
"(" => Tok::LPAR{location: <String>, value: <String>},
…
edit: solved, grammar;
was missing
Link to the source? (Actually could you open a separate issue?) On Aug 3, 2016 9:40 PM, pothos notifications@github.com wrote:Somehow I can't compile anymore now. It says
src/grammar.lalrpop:9:1: 9:6 error: unexpected token: extern
extern {
refering to my own Tok data structure which I am using:
use tok::Tok;
extern {
enum Tok {
"(" => Tok::LPAR{location:
—You are receiving this because you were mentioned.Reply to this email directly, view it on GitHub, or mute the thread.
Thanks for the patch @nikomatsakis . I'm going to try it sometime today and report back.
I still have to stick to
#[LALR]
#[recursive_ascent] grammar;
which yields a 19MB file.
Otherwise I get an OOM with rustc or something worse , because
grammar;
results in a file size of 190MB
and just
#[LALR] grammar;
results in a file size of 57MB and later throws
this function takes 1 parameter but 2 parameters were supplied [E0061]
src/grammar.rs:368736 Tok::STRING { location: __tok0, value: __tok1 } => __Symbol::TermSTRING(__tok0, __tok1),
New version published. @pothos, I'm not sure what's up with the OOM -- I guess rustc isn't coping well with a big array. The 190MB file is probably largely comments, would be my guess, but I'm not sure. I can maybe try to investigate.
Should LALRPOP automatically emit comments? I think it should not unless told to (for debugging LALRPOP).
I finally tried this. With grammar;
I get 51383 loc Rust file, with #[LALR] grammar;
I get a 40508 loc one. Either way it's huge IMHO. Compile times still suffer.
Now that we're using the new lexer, we should be generated a lot less code as well. I'd be curious to see some new measurements. Though I probably need to do a new public release first.
Well, I'm not sure about how the new lexer is better exactly. But I can say it's way much better than one year ago! At those time (the version of lalrpop was 0.10.0 I think) I had something like 0.5 MB with LALR and a custom lexer , now I have 0.18 MB for the generated file using the generated lexer. And with LR, I had 4.1 MB, and now 0.64 MB so it's quite better also.
@joris-r great :)
Going to close this, i think we've made a lot of progress here.
I'm super late to the party. But I want to add what I've found:
Avoid use pub
frivolously. Because one pub
creates one parser. I reduced the number of pub
from 20 to 2, and the number of lines went from ~2M to ~20K.
@tanin47 If multiple pub
parsers are needed it is still possible to coax LALRPOP into generating one parser for all of them using the approach in https://github.com/lalrpop/lalrpop/pull/414.
@Marwes could you elaborate more how to achieve this? I'm not sure I understand how conditional compilation should be used.
Hello,
i try to parse php expressions using lalrpop: https://github.com/timglabisch/rustphp/blob/c79060d6495a55174fc2ad5710d5774d8ec94d67/src/calculator1.lalrpop
my problem is that cargo run becomes incredible slow:
time cargo run 1234.76s user 19.88s system 99% cpu 20:58.78 total
the file is very huge (~50mb):
cat src/calculator1.rs | wc -l 1044621
is there something fundamentally wrong or is this expected?