Closed Frizi closed 9 years ago
I guess that this will be really hard to change the performance characteristics without fundamentally rewriting the regex part. As @jsoverson mentioned it a few times, we will have to implement a parser/tokenizer so that the regex part will impose less penalty on processing.
I'd say not implement a parser, bur rather define a language and use something like PEG.js. But YMMV.
Now PEG.js is interesting. Definitely worth a look.
I would be willing to help with a migration to a proper grammar parsing engine. This seems to me to be the direction that the tool should be heading in, as it continues to gain more rules and features. I would suggest that we would gain 2 major benefits from moving to a tool such as peg.js -
Hopefully this makes sense to everyone, and I would definitely be interested in this being the future direction of preprocess.
@orionstein, since you have commit access, you can start implementing this, if you want to. I would also find it very interesting. I've never implemented a parser/tokenizer before nor have I used PEG.js to define a language grammar. I'd like to participate in this. I think that the best way to start doing it in collaboration is to set up a milestone and create tasks that each of us can assign to oneselves to coordinate the implementation in a branch.
The problem is that to verify that we actually are improving things first we need to set up some performance tests in order to measure our progress, as @Frizi suggested. We have therefore to either make it in nodeunit or port the suite to another test runner (like mocha).
@BendingBender, I'll start putting it together in my fork - I'm not sure, however, if I would want to commit here a version that had some elements implemented in PEG and some implemented in regex. I think a milestone or feature/directive approach would work well, but it would be a matter of figuring out how to collaborate while remaining the integrity of the current regex system until it is deemed stable enough to move over. A larger scale performance test would be good to write in any case, as it will let us compare the grammar vs regex systems, but also in the case we decide to stick with a regex based system it will help us identify and optimize problematic code moving forward.
I was suggesting to work on a branch so that each of us can contribute and can see the current progress. This would prevent situations where the same feature is implemented twice by two different devs and it wouldn't interfere with master. But do as you like, it's your time investment.
Writing real AST generator with PEG could also help with #59 a lot. In case you are implementing this already, preprocess should be able to output generated AST next to actual output. This may be very helpful for external tools (I'm looking at you wallabyjs).
@Frizi did you observe the build duration regression as a consequence of recent changes on this lib or was it as you've introduced preprocess
in your project?
Hard to say, as I didn't had much files preprocessed before. Also, after trying to process everything, I included big libraries to the mix. After blacklisting, it get down to few seconds. It means measurements are needed.
Added performance measuring for the test suite run in #71. Not sure whether this will suffice but at least we can compare performance during development of one version. Maybe I should add another perf measurement test but not now.
After using this tool on project with reasonable size, my build times skyrocketed from ~20s to ~1m 20s, which means it adds a lot more processing than all the other steps combined. And there are a lot of them (pretty much every single file gets analysed at least once, most of them are compiled). And there is almost no processing currently done, so most of its overhead is dedicated to pattern matching. We already have unit tests here, but every suite has only a minimial snippet of code.
New test suite should be done in order to test performance on real sized files and to be able to compare between versions. After that, performance optimizations can be done with more certainty.