Investigate performance problem of the tokenizer

ariya commented 9 years ago

Migrated from https://code.google.com/p/esprima/issues/detail?id=626.

As reported by @aslushnikov, it seems that tokenize is way too slower compared to parse.

ariya commented 9 years ago

Here are some numbers from the current master branch and Node.js 0.12.

Running node test/benchmark.js for parse():

    Underscore 1.5.2    42.5 KiB      6.21 ms ± 2.07%
      Backbone 1.1.0    58.7 KiB      6.04 ms ± 0.81%
      MooTools 1.4.5   156.7 KiB     30.41 ms ± 0.94%
        jQuery 1.9.1   262.1 KiB     32.15 ms ± 0.96%
          YUI 3.12.0   330.4 KiB     29.39 ms ± 0.80%
 jQuery.Mobile 1.4.2   442.2 KiB    109.21 ms ± 9.28%
       Angular 1.2.5   701.7 KiB     74.80 ms ± 6.59%
                     ------------------------
                      1994.3 KiB    288.21 ms ± 6.65%

To check the tokenizer, apply the following patch:

diff --git a/test/benchmarks.js b/test/benchmarks.js
index cf7673f..d791c20 100644
--- a/test/benchmarks.js
+++ b/test/benchmarks.js
@@ -322,8 +322,8 @@ if (typeof window !== 'undefined') {
                     size = source.length;
                 totalSize += size;
                 return suite.add(filename, function () {
-                    var syntax = esprima.parse(source, { range: true, loc: true });
-                    tree.push(syntax.body.length);
+                    var tokens = esprima.tokenize(source, { range: true, loc: true });
+                    tree.push(tokens.length);
                 }, {
                     'onComplete': function (event, bench) {
                         if (typeof gc === 'function') {

And now running node test/benchmark.js shows the number for tokenize():

    Underscore 1.5.2    42.5 KiB      4.93 ms ± 0.85%
      Backbone 1.1.0    58.7 KiB      5.35 ms ± 0.66%
      MooTools 1.4.5   156.7 KiB     53.27 ms ± 6.47%
        jQuery 1.9.1   262.1 KiB     64.54 ms ± 6.52%
          YUI 3.12.0   330.4 KiB     48.70 ms ± 5.46%
 jQuery.Mobile 1.4.2   442.2 KiB    103.97 ms ± 8.11%
       Angular 1.2.5   701.7 KiB     77.81 ms ± 7.07%
                     ------------------------
                      1994.3 KiB    358.56 ms ± 6.92%

ariya commented 9 years ago

@aslushnikov @ikarienator Can you double check with your own setup/environment?

ikarienator commented 9 years ago

I knew there is a performance problem with the tokenize function due to the reconstruction of output tokens. In fact the difference is more aggressive if you turn off range and loc.

ariya commented 9 years ago

Now with range and loc both set to false.

For parse():

    Underscore 1.5.2    42.5 KiB      6.28 ms ± 1.77%
      Backbone 1.1.0    58.7 KiB      6.15 ms ± 0.76%
      MooTools 1.4.5   156.7 KiB     31.12 ms ± 0.91%
        jQuery 1.9.1   262.1 KiB     32.82 ms ± 0.86%
          YUI 3.12.0   330.4 KiB     29.60 ms ± 0.74%
 jQuery.Mobile 1.4.2   442.2 KiB    110.85 ms ± 9.35%
       Angular 1.2.5   701.7 KiB     77.49 ms ± 7.43%
                     ------------------------
                      1994.3 KiB    294.31 ms ± 6.91%

For tokenize():

    Underscore 1.5.2    42.5 KiB      4.11 ms ± 0.81%
      Backbone 1.1.0    58.7 KiB      4.44 ms ± 0.73%
      MooTools 1.4.5   156.7 KiB     47.27 ms ± 7.52%
        jQuery 1.9.1   262.1 KiB     54.11 ms ± 4.48%
          YUI 3.12.0   330.4 KiB     42.64 ms ± 6.67%
 jQuery.Mobile 1.4.2   442.2 KiB     92.78 ms ± 6.55%
       Angular 1.2.5   701.7 KiB     69.94 ms ± 4.96%
                     ------------------------
                      1994.3 KiB    315.29 ms ± 6.00%

aslushnikov commented 9 years ago

@ariya,

My real-world example is tokenization of one of Inbox client scripts - that's roughly 580k tokens of obfuscated javascript. Let's get the sample of this size with the following command:

cd test/3rdparty
cat underscore-1.5.2.js backbone-1.1.0.js mootools-1.4.5.js jquery-1.9.1.js yui-3.12.0.js jquery.mobile-1.4.2.js angular-1.2.5.js underscore-1.5.2.js backbone-1.1.0.js mootools-1.4.5.js jquery-1.9.1.js yui-3.12.0.js jquery.mobile-1.4.2.js angular-1.2.5.js angular-1.2.5.js > big.js

The command produces big.js file whose tokenization results in 578335 tokens. I've modified benchmark.js to include big.js and to do tokenization instead of parse, and here are my results for tip-of-tree Esprima:

    Underscore 1.5.2    42.5 KiB      4.40 ms ± 4.99%
      Backbone 1.1.0    58.7 KiB      5.13 ms ± 1.35%
      MooTools 1.4.5   156.7 KiB     43.01 ms ± 5.09%
        jQuery 1.9.1   262.1 KiB     58.65 ms ± 7.80%
          YUI 3.12.0   330.4 KiB     40.88 ms ± 3.26%
 jQuery.Mobile 1.4.2   442.2 KiB     95.26 ms ± 8.17%
       Angular 1.2.5   701.7 KiB     89.68 ms ± 4.87%
                 big  4690.4 KiB   1152.83 ms ± 10.79%
                     ------------------------
                      6684.7 KiB   1489.86 ms ± 9.96%

We've spent more then a second for tokenization of 4.5 mb of code. This might not look like outstandingly bad performance, but in fact we can do way better.

To show off that the performance here is suboptimal, let me run the same benchmark, but with the modified esprima branch, taken from here: https://github.com/aslushnikov/esprima/tree/sax Instead of pushing tokens into array, the modified branch just reports them via callback. This saves us on memory reallocs (I'm not an expert in V8 - just guessing), and results in significant performance gains. These are results for modified Esprima:

    Underscore 1.5.2    42.5 KiB      4.51 ms ± 2.09%
      Backbone 1.1.0    58.7 KiB      4.68 ms ± 1.35%
      MooTools 1.4.5   156.7 KiB     20.17 ms ± 2.10%
        jQuery 1.9.1   262.1 KiB     25.25 ms ± 1.45%
          YUI 3.12.0   330.4 KiB     20.73 ms ± 1.34%
 jQuery.Mobile 1.4.2   442.2 KiB     42.14 ms ± 2.26%
       Angular 1.2.5   701.7 KiB     35.53 ms ± 2.24%
                 big  4690.4 KiB    324.30 ms ± 7.09%
                     ------------------------
                      6684.7 KiB    477.32 ms ± 5.95%

That's 3.5 times faster for big.js.

Update The diff for modified benchmark.js could be found here: https://gist.github.com/aslushnikov/fbd25ec7a90e8f87d526

ariya commented 9 years ago

@aslushnikov I agree that the tokenizer can be still optimized. At this step, we try to collect as much information as possible before making any modification.

In the benchmark.js that you show, do I understand correctly that the callback is just an empty function? Does the benchmark still produce and save the list of the tokens somewhere?

ikarienator commented 9 years ago

A better way to benchmark will be clear the list in the onComplete callback and call gc() immediately. The garbage generated in previous runs can be collected in the next runs, thus increase the variation. I'll send a pull request to improve this. On Thu, Feb 12, 2015 at 08:47 Ariya Hidayat notifications@github.com wrote:

@aslushnikov https://github.com/aslushnikov I agree that the tokenizer can be still optimized. At this step, we try to collect as much information as possible before making any modification.

In the benchmark.js that you show, do I understand correctly that the callback is just an empty function? Does the benchmark still produce and save the list of the tokens somewhere?

— Reply to this email directly or view it on GitHub https://github.com/jquery/esprima/issues/1019#issuecomment-74105493.

aslushnikov commented 9 years ago

@ariya yep that's just an empty function. If the client wants the list of the tokens, he can just put them in array on its own. But for some problems streaming interface is enough - and it gonna work much faster

ariya commented 9 years ago

@aslushnikov But that means in the benchmark numbers you posted is not a fair comparison. If the callback is also adding the token to an array, what will be the numbers?

aslushnikov commented 9 years ago

@ariya, this is not a comparison of "how much time does it take to return an array of tokens to the client". What I want to compare is "how much time does it take to report tokens to the client".

My point here is that there are clients of esprima.tokenize that do not need tokens to be stored in an array; they will be just fine with the tokens being streamed. Consider an example where I just want to get number of for loops in source code. I don't need an array of tokens - why should my performance suffer?

ikarienator commented 9 years ago

Unfortunately using the current architecture we aren't able to provide a stream of token because the tokenizer do not support reentrance. But it will be possible after we reorganize the code and expose tokenizer and parser as objects. On Fri, Feb 13, 2015 at 01:31 Andrey Lushnikov notifications@github.com wrote:

@ariya https://github.com/ariya, this is not a comparison of "how much time does it take to return an array of tokens to the client". What I want to compare is "how much time does it take to report tokens to the client".

My point here is that there are clients of esprima.tokenize that do not need tokens to be stored in an array; they will be just fine with the tokens being streamed. Consider an example where I just want to get number of for loops in source code. I don't need an array of tokens - why should my performance suffer?

— Reply to this email directly or view it on GitHub https://github.com/jquery/esprima/issues/1019#issuecomment-74227040.

ariya commented 9 years ago

@aslushnikov Sounds like a perfectly valid use case. Just to be clear, I'd like to understand (for this particular use case) which one is more suitable for you:

using the tokenizer as is but with a much better performance?
having the tokenization callback (memory and speed-up are bonus)?

aslushnikov commented 9 years ago

@ariya as far as the performance of parsing a few megabytes of js is great - i'm all good! And of course I would prefer the Esprima API to not be changed - if it is even possible.

aslushnikov commented 9 years ago

Any updates on this?

ikarienator commented 9 years ago

Not yet...

A little bit side track but I think in general running the standalone tokenizer should be slower than running the parser. Also in theory, there is no way to tokenize JS correctly without parsing it (requires a pushdown automaton (equivalent to a parser of any context-free language) instead of an FSM (equivalent to regular expression)). This is especially true for ES6 when we have template literals. We used several heuristics (an algorithm similar to sweetjs) in the standalone tokenizer to guess the next lexical goal, but it's not 100%. You are not supposed to use it in production.

aslushnikov commented 9 years ago

@ikarienator thank you for the response. Can we have a tokenization callback then?

ikarienator commented 9 years ago

@aslushnikov What is your use case for standalone tokenizer? Why can't you parse the tree and drop the tree information? In fact, I think esprima.tokenizer should do exactly that. What do you think @ariya?

ariya commented 9 years ago

A pure tokenizer is still a valid use case, particularly since it can consume partially valid JavaScript. It is useful for things like syntax highlighter or even auto completer (e.g. aulx).

ariya commented 9 years ago

@aslushnikov We shall try to understand the performance problem first before adding a new API function. Help is welcomed in this matter!

I just ran a quick test using V8 profiler when tokenizing AngularJS code. The sampling profile looks like:

 [JavaScript]:
   ticks  total  nonlib   name
    702    9.1%   23.0%  LazyCompile: *collectToken esprima.js:1424:26
    344    4.4%   11.3%  LazyCompile: *skipMultiLineComment esprima.js:412:34
    286    3.7%    9.4%  LazyCompile: *scanIdentifier esprima.js:643:28
    283    3.7%    9.3%  LazyCompile: *filterTokenLocation esprima.js:4101:33
    207    2.7%    6.8%  LazyCompile: *scanPunctuator esprima.js:678:28
    198    2.6%    6.5%  LazyCompile: *skipComment esprima.js:472:25

ikarienator commented 9 years ago

Can you try a minified version?

ariya commented 9 years ago

Minified Esprima or minified AngularJS?

ikarienator commented 9 years ago

Minified AngularJS On Wed, Feb 25, 2015 at 18:06 Ariya Hidayat notifications@github.com wrote:

Minified Esprima or minified AngularJS?

— Reply to this email directly or view it on GitHub https://github.com/jquery/esprima/issues/1019#issuecomment-76109256.

aslushnikov commented 9 years ago

I apologize for being away for some time.

Why can't you parse the tree and drop the tree information?

@ikarienator the AST lacks semicolons. Also I would need to know the exact tree structure to walk it and have a mapping between node types and javascript tokens.

I just ran a quick test using V8 profiler when tokenizing AngularJS code.

@ariya AFAIK array reallocs are not represented in v8.log. Moreover, the test/3rdparty/angular-1.2.5.js is only 53079 tokens long; you won't be able to feel the destructive effect of array reallocations on the arrays of this size.

Performance audit

esprima.tokenize working time hugely correlates with the amount of javascript tokens in the input file (and AFAIU the algorithm is O(N) given N is the size of input). The concern here is that esprima.tokenize performance degrades non-linearly as the size of input gets significantly larger. In the following table I additionally output amount of javascript tokens in the last column.

lushnikov:~/prog/esprima/test(master)$ node benchmarks.js 
Doesn't have exposed gc().
    Underscore 1.5.2    42.5 KiB     12.71 ms ± 10.20%     7493 tok
      Backbone 1.1.0    58.7 KiB     13.00 ms ± 10.07%     8635 tok
          YUI 3.12.0   330.4 KiB     45.07 ms ± 11.39%    34996 tok
      MooTools 1.4.5   156.7 KiB     53.96 ms ± 9.80%     38672 tok
        jQuery 1.9.1   262.1 KiB     68.32 ms ± 15.26%    46112 tok
       Angular 1.2.5   701.7 KiB     74.61 ms ± 12.35%    53079 tok
 jQuery.Mobile 1.4.2   442.2 KiB     97.53 ms ± 15.63%    73641 tok
                 big  4690.4 KiB   1735.33 ms ± 11.79%   578335 tok

The big has 7 times larger input then jQuery.Mobile, but works 17 times longer.

1. Data set

I would insist on using a big.js file for profiling purposes. It is 578335 tokens long and effectively reveals the problem (which is hardly observable on smaller datasets):

cd test/3rdparty
# generate big.js
cat underscore-1.5.2.js backbone-1.1.0.js mootools-1.4.5.js jquery-1.9.1.js yui-3.12.0.js jquery.mobile-1.4.2.js angular-1.2.5.js underscore-1.5.2.js backbone-1.1.0.js mootools-1.4.5.js jquery-1.9.1.js yui-3.12.0.js jquery.mobile-1.4.2.js angular-1.2.5.js angular-1.2.5.js > big.js

2. Benchmark

Now lets modify the benchmark to run tokenization solely on big.js. Apply the following patch to the test/benchmark.js:

--- a/test/benchmarks.js
+++ b/test/benchmarks.js
@@ -31,13 +31,7 @@ var setupBenchmarks,
     quickFixture;

 fullFixture = [
-    'Underscore 1.5.2',
-    'Backbone 1.1.0',
-    'MooTools 1.4.5',
-    'jQuery 1.9.1',
-    'YUI 3.12.0',
-    'jQuery.Mobile 1.4.2',
-    'Angular 1.2.5'
+    'big'
 ];

 quickFixture = [
@@ -322,8 +316,8 @@ if (typeof window !== 'undefined') {
                     size = source.length;
                 totalSize += size;
                 return suite.add(filename, function () {
-                    var syntax = esprima.parse(source, { range: true, loc: true });
-                    tree.push(syntax.body.length);
+                    var syntax = esprima.tokenize(source, { range: true, loc: true });
+                    tree.push(syntax.length);
                 }, {
                     'onComplete': function (event, bench) {
                         if (typeof gc === 'function') {

3. Baseline

Let's run the tip-of-the-tree esprima version on the benchmark to figure out its performance.

lushnikov:~/prog/esprima(master)$ git log --oneline | head -1
bf5c615 fixes #1106: do not accept invalid/incomplete string escape sequences
lushnikov:~/prog/esprima(master)$ node test/benchmarks.js 
Doesn't have exposed gc().
                 big  4690.4 KiB   1814.81 ms ± 20.40%
                     ------------------------
                      4690.4 KiB   1814.81 ms ± 20.40%

Result: 1814ms.

4. Hypothesis

I will verify that the major overhead happens somewhere inside extra.tokens.push method (extra.tokens is an array that holds all the produced tokens). This might happen due to continuous array resizings that happen multiple times.

5. Analysis

I will record v8.log for future reference, though you won't be able to see array.push in it.

lushnikov:~/prog/esprima(master)$ time node --prof test/benchmarks.js 
Doesn't have exposed gc().
                 big  4690.4 KiB   1996.43 ms ± 19.37%
                     ------------------------
                      4690.4 KiB   1996.43 ms ± 19.37%

real    0m23.590s
user    0m22.868s
sys     0m2.367s

The v8 log points that the heaviest function is collectTokens, followed by filterTokenLocation. These functions do a similar thing - they iteratively push 538000 objects in arrays.

[Bottom up (heavy) profile]:
  Note: percentage shows a share of a particular caller in the total
  amount of its parent calls.
  Callers occupying less than 2.0% are not shown.

   ticks parent  name
  16800   76.4%  .../lushnikov/bin/node
   6202   36.9%    LazyCompile: ~collectToken .../lushnikov/prog/esprima/esprima.js:1427:26
   6199  100.0%      LazyCompile: *lex .../lushnikov/prog/esprima/esprima.js:1463:17
   6199  100.0%        LazyCompile: tokenize .../lushnikov/prog/esprima/esprima.js:4330:22
   6199  100.0%          LazyCompile: ~<anonymous> .../lushnikov/prog/esprima/test/benchmarks.js:318:53
   6199  100.0%            Function: ~<anonymous> :1:10
   3075   18.3%    LazyCompile: ~filterTokenLocation .../lushnikov/prog/esprima/esprima.js:4303:33
   3073   99.9%      LazyCompile: tokenize .../lushnikov/prog/esprima/esprima.js:4330:22
   3073  100.0%        LazyCompile: ~<anonymous> .../lushnikov/prog/esprima/test/benchmarks.js:318:53
   3073  100.0%          Function: ~<anonymous> :1:10
   3073  100.0%            LazyCompile: clock .../lushnikov/prog/esprima/test/3rdparty/benchmark.js:2440:21

In order to indirectly prove that the performance offender is an array and its large size, lets clamp the extra.tokens size to 1000. This is safe to do as far as there are at least a few tokens in the array, as the esprima tokenization code relies on some last-parsed tokens in its logic. Apply the following diff to the esprima.js.

diff --git a/esprima.js b/esprima.js
index aebf398..81c6656 100644
--- a/esprima.js
+++ b/esprima.js
@@ -1294,6 +1294,8 @@
                 range: [pos, index],
                 loc: loc
             });
+            if (extra.tokens.length > 1000)
+                extra.tokens = extra.tokens.slice(-10);
         }

         return regex;
@@ -1455,6 +1457,8 @@
                 };
             }
             extra.tokens.push(entry);
+            if (extra.tokens.length > 1000)
+                extra.tokens = extra.tokens.slice(-10);
         }

         return token;

Running the benchmark on the updated version of the code reveals 5x speedup: 322ms in this version counter the 1814ms for the tip-of-tree.

lushnikov:~/prog/esprima(master)$ node test/benchmarks.js 
Doesn't have exposed gc().
                 big  4690.4 KiB    322.08 ms ± 2.23%
                     ------------------------
                      4690.4 KiB    322.08 ms ± 2.23%

Conclusion

Given the above, the main performance offender seems to be a large array. One way to fix this might be avoiding having such a large array and reporting tokens via callback (which my original pull request was about).

ariya commented 9 years ago

@aslushnikov Thanks for the excellent and detailed analysis! I think we may want to pursue the path of providing delegates, since there is also another use case for that, see #1113. Just like your original pull request, we then shall implement the tokenizer delegate. There are some implications of that which we may need to take into account, I'll spawn a new issue soon to initiate the discussion.

ariya commented 9 years ago

I believe this is now resolved with the introduction of token delegator support (#1332).

jquery / esprima