no-context / moo

Optimised tokenizer/lexer generator! 🐄 Uses /y for performance. Moo.
BSD 3-Clause "New" or "Revised" License
821 stars 65 forks source link

Performance and benchmarks #73

Closed ghost closed 6 years ago

ghost commented 7 years ago

Moo is the fastest JS tokenizer around. It's ~2–10x faster than most other tokenizers; it's a couple orders of magnitude faster than some of the slower ones.

Really? In my experience so far, moo is significantly slower than ANTLR's lexer, probably related to the fact that regexp isn't necessarily fast to begin with.

ANTLR 4.7's generated JS lexer is about 40% faster and use less memory than moo. The input grammar is CSV and about 2MB of data. I'll post the code and input file when I get around to it, just a heads up for now.

bd82 commented 7 years ago

What engine did you test this on? (Browser/Node version?)

ghost commented 7 years ago

@bd82 latest Node LTS (6.11.2)

bd82 commented 7 years ago

Would be interesting to check on latest Node 8.4 with a more modern V8. https://v8project.blogspot.co.il/2017/01/speeding-up-v8-regular-expressions.html

Reproducing.

I've tried reproducing your results by modifying an existing JS Parser benchmark I've authored. But without success. Using a JSON syntax and 1,000 lines sample I see Moo being faster than Antlr.

On Original statement.

I agree that the original statement is not very accurate, and more importantly impossible to prove under all possible conditions... (JS engines / Different grammars).

Personally I would use a more general statement such as:

Moo is very/incredibility/super-duper fast, often beating competitors by multipliers or even orders of magnitudes.

And link to some online benchmark that proves the claims. But that is up for @tjvr to decide where the line between marketing and precise accuracy falls for this project.

tjvr commented 7 years ago

@bd82 Thanks for looking into this!

I’ll look into adding antlr to the benchmarks Moo uses, at some point. :-)

bd82 commented 7 years ago

I’ll look into adding antlr to the benchmarks Moo uses, at some point. :-)

it could be a little annoying to support because a Java application is used to generate the lexer and that jar is not available on npm (afaik).

ghost commented 7 years ago

@bd82 you must've used a bad antlr grammar. Also, the problem occurs with large input (2 MB in my case), so I assume memory trashing (gc pressure) is the main cause?

ghost commented 7 years ago

Would be interesting to check on latest Node 8.4 with a more modern V8.

Have to use LTS

nathan commented 7 years ago

@notsonotso

you must've used a bad antlr grammar. Also, the problem occurs with large input (2 MB in my case), so I assume memory trashing (gc pressure) is the main cause?

Please provide a reproducible test case (generated antlr lexer, moo lexer, and input file), or this discussion won't go anywhere useful. GitHub Gist is good for this.

bd82 commented 7 years ago

@notsonotso

you must've used a bad antlr grammar. Also, the problem occurs with large input (2 MB in my case), so I assume memory trashing (gc pressure) is the main cause?

You can inspect the grammar I've used. It is originally from the Antlr's example grammar repository. https://github.com/SAP/chevrotain/blob/gh-pages/performance/jsonParsers/antlr/JSON_ANTLR.g4

You can examine it and reproduce it locally if you checkout the commit I linked above. Just checkout the commit/repo and open the performance/index.html page in a local browser.

It could be something with very large files, could be that some token patterns in a CSV are more suitable to be lexed using an antlr generated state machine instead of a RegExp or that the inverse is true in case of JSON tokens.

As @nathan said, once you have a reproducible example we can look into this more deeply.

Have to use LTS

Try NVM it works on Mac/Linux to rapidly switch node.js versions and have multiple node versions installed at the same time.

ghost commented 7 years ago

It could be something with very large files, could be that some token patterns in a CSV are more suitable to be lexed using an antlr generated state machine

This sounds like a reasonable explanation.

Why don't you try with a 2MB file on your test code?

ghost commented 7 years ago

Try NVM it works on Mac/Linux to rapidly switch node.js versions and have multiple node versions installed at the same time.

Why? I'll never be able to use anything other than LTS in production, so I don't see the point.

bd82 commented 7 years ago

Why? I'll never be able to use anything other than LTS in production, so I don't see the point.

I fear that is a bit too simplified:

Why don't you try with a 2MB file on your test code? I will try that.

ghost commented 7 years ago

@bd82 i tried Node 8.4.0 with the same result :( Each line of my csv input is about 500-600 characters... i wonder if some regexp is blowing up (while looking for EOL?)

tjvr commented 7 years ago

Unless you show your code, we can only speculate. As Nathan said, this is going nowhere.

tjvr commented 6 years ago

Closing, since the discussion didn't lead anywhere.