mathjax / MathJax-node

MathJax for Node
Apache License 2.0
615 stars 97 forks source link

Request for tips on getting fastest performance from MathJax-node #246

Closed michael-okeefe closed 8 years ago

michael-okeefe commented 8 years ago

I wanted to start by thanking the makers of MathJax and MathJax-node. These are incredibly useful pieces of software and it's really helped us share some sophisticated mathematical documents on the web!

I've been using MathJax-node to pre-process a large number of equations for a set of online documents (see here -- at the time I write this, the pre-rendered equations have not been made live, but you can see a preview here). Version 8.5 of the document called "Engineering Reference" is especially heavy on equations if you want to see what MathJax is doing for us. We decided to explore pre-rendering to reduce the large page-processing times and jumping for the current docs.

I'm largely satisfied with the pre-rendering quality (we tried both SVG and HTML/CSS and decided to go with HTML/CSS because it can be selected). However, I was disappointed at how long the pre-rendering process takes although this is obviously a big job.

It's currently taking over 3.0 hours for the equation pre-rendering portion. This is a job that runs over 3,282 HTML files which contain a total of 26,358 inline math equations (typically fairly trivial, like adding a superscript \(^2\) or subscript \(_{zone}\)) and 16,837 display math equations which are typically fairly involved (see for example, this section). If I did my math correctly, that's about 240 equations per minute. This includes other overhead for extracting the equations from the html and reinserting the processed output (using Ruby/Nokogiri) but the equation processing takes the lion's share of the processing time.

My question: is 240 equations per minute in the ballpark for expected performance for MathJax-node rendering (considering a mix of some trivial and many complex equations)? Do you have any tips for increasing the throughput-performance? Are there any special flags or settings I could be using to help performance here?

Since we don't have to do this that often, I can live with the processing time, but it is painful when we have to do a full build.

I've included the node.js code I'm using to call MathJax-node. Externally, I'm using Ruby and Nokogiri to parse the HTML and feed all of the equations into the node.js program via a file named by the "in_file" variable. I grab all of the inline equations and process them and then feed in all of the display math equations. The equations are passed in via the file "in_file" which is just a flat text file with all of the equations separated by newlines. In this instance, I'm using the HTML/CSS processor and I collect all of the HTML into a single file and all of the CSS into another file and then use Ruby to reinsert the html and css back into the HTML pages:

var mjAPI = require('mathjax-node/lib/mj-single.js');
var fs = require('fs');
mjAPI.config({
  MathJax: {
    // traditional MathJax configuration
  }
});
mjAPI.start();

function remove_delimiters(content) {
  return content
    .trim()
    .replace(/\r\n?|\n/, '')
    .replace(/&lt;/g, '<')
    .replace(/&gt;/g, '>')
    .replace(/&amp;/g, '&')
    .replace(/\\\*/g, '*')
    .replace(/^\\\(/, '')
    .replace(/\\\)$/, '')
    .replace(/^\\\[/, '')
    .replace(/\\\]$/, '');
}

var inline = process.argv[2] === "true";
var format = '';
if (inline) {
  format = "inline-TeX";
} else {
  format = "TeX";
}
var in_file = process.argv[3];
var out_html = process.argv[4];
var out_css = process.argv[5];
console.log('in_file: ' + in_file);
console.log('out_html: ' + out_html);
console.log('out_css: ' + out_css);

var array = fs.readFileSync(in_file).toString().trim().split("\n");
var lastIdx = array.length - 1;
console.log("lastIdx: " + lastIdx);
var htmls = [];
var css = '';
for(var i=0; i<array.length; i++) {
  console.log("eq " + i + " " + array[i]);
  var f = function(idx) {
    return function(data) {
      if (!data.errors) {
        css += data.css;
        htmls.push(data.html);
      }
      if (idx === lastIdx) {
        fs.writeFileSync(out_html, htmls.join("\n\n"));
        fs.writeFileSync(out_css, css);
        console.log("Done!");
      }
    };
  };
  mjAPI.typeset({
    math: remove_delimiters(array[i]),
    format: format, // "TeX", "inline-TeX", "MathML"
    html:true,
    css:true
  }, f(i));
}
dpvc commented 8 years ago

Thanks for you kind comments about MathJax and mathJax-node. We are very glad that you have been able to make good use of them, and your pages look very nice.

As for the performance issues with mathjax-node, the jsdom library that underlies mathjax-node is one of the bottlenecks in the process, I'm afraid. Your performance is consistent with my own testing (using the 23 display equations from the page you link to above, repeated multiple times).

There are, however, some things you can do to help a bit. First, the CSS is the same for all HTML expressions, so you only need to compute it once, and then just use that. There is no need to copy it once for each expression. You might also create the htmls array with the proper initial size (using Array(array.length) so as not to have to extend it as the expressions are processed (though that is a minor issue).

My main concern, however, is that you are not treating the asynchronous processing done by MathJax in the best way. Here's what happens: when you do the mjAPI.start(), MathJax begins running, but has to stop to wait for some files to load (the input and output jax, etc), so your program go on while that happens. You read the expressions out of the file, and then call mjAPI.typeset() on the first one. Because MathJax is waiting for its files to load, it can't process the math at that point, and so it queues the request for later. Your program goes on to iterate the loop and call mjAPI.typeset() for the second equation. Again, MathJax must queue the request. Similarly for the third, fourth, and all the remaining expressions. So your entire loop runs for all 26,000 expressions, queuing them all, before your program gives up the CPU and lets MathJax continue with its suspended operations. At that point, the file can be loaded and MathJax can continue processing. Once it has loaded all the files it needs, it can go on and process your first expression, and calls your callback when it is done. When your callback returns, it goes on the second, third, and remaining equations in the same fashion, until the queue of expressions is empty.

There is memory overhead involved in saving the queue (with all its callbacks and expressions), and the garbage collection involved in cleaning it up. It's possible the process grows so large that paging for virtual memory becomes an issue (I didn't look into that). Personally, I prefer not to start the next typeset operation until the previous one completes. That means you have to think about the loop in a different way (using only callbacks, not a for loop). Here is an example:

var mjAPI = require('./lib/mj-single.js');
var fs = require('fs');
mjAPI.config({
  MathJax: {
    // traditional MathJax configuration
  }
});
mjAPI.start();

function remove_delimiters(content) {
  return content
    .trim()
    .replace(/\r\n?|\n/, '')
    .replace(/&lt;/g, '<')
    .replace(/&gt;/g, '>')
    .replace(/&amp;/g, '&')
    .replace(/\\\*/g, '*')
    .replace(/^\\\(/, '')
    .replace(/\\\)$/, '')
    .replace(/^\\\[/, '')
    .replace(/\\\]$/, '');
}

var inline = process.argv[2] === "true";
var format = '';
if (inline) {
  format = "inline-TeX";
} else {
  format = "TeX";
}
var in_file = process.argv[3];
var out_html = process.argv[4];
var out_css = process.argv[5];

var array = fs.readFileSync(in_file).toString().trim().split("\n");
var htmls = Array(array.length);
var i;

function saveAndContinue(data) {
  htmls[i] = data.html;
  processMath(i+1);
}

function processMath(n) {
  i = n; var math = array[i];
  if (math != null) {
    mjAPI.typeset({
      math: math,
      format: format,
      html: true
    }, saveAndContinue);
  } else {
    fs.writeFileSync(out_html, htmls.join("\n\n"));
  }
}

mjAPI.typeset({
  math: "",
  html: true,
  css: true
}, function (data) {
  fs.writeFileSync(out_css, data.css);
  processMath(0);
});

Here, we have a function processMath(i) that processes the i-th math expression, using as its callback the saveAndContinue() function. This saves the output and calls processMath() to do the next expression. So the next typeset only occurs when the previous one is done, and you don't have the issue of queuing all 26,000 expressions before any are processed. When we fall off the end of the array (math == null in that case), we write out the HTML file.

At the very bottom, we have to start the whole process. We also need to get the CSS (as I mentioned, we only need to do that once). We do this by typesetting an empty expression and collecting the CSS from it, saving it in the CSS output file, and calling processMath(0) to get the first expression started.

Another advantage of this is that you don't have to create several functions during each iteration of the main loop (your closure to get the idx variable passed to the callback). We use a global variable to avoid creating new functions at each step.

So there you have it. This performs marginally better, but it is not going to make that big a difference. I'm afraid the process is just slower than we would like. The jsdom that mathjax-node uses, and the CSSStyleDeclaration library that underlies it, are a major part of the slowdown. (The SVG output seems to be about 40% faster in mathjax-node, and I suspect it is because it doesn't have to set very many style attributes like the CommonHTML output does.) Sorry I can't give you much to make things faster!

michael-okeefe commented 8 years ago

Thank you so much for your extensive and thoughtful answer! I am excited to try out these changes when I get to work next week. I'm very new to node.js and JavaScript so I appreciate your help in critiquing the code.

Kind regards,

dpvc commented 8 years ago

I will be interested to hear if there was any improvement in the performance from this. Good luck!

michael-okeefe commented 8 years ago

Thank you again for the advice on enhancing the performance of MathJax-node.

Just to close the loop, I did rebuild our website yesterday morning with the new changes in place. Unfortunately, due to the realities of work, during both my initial session (the one I reported) and this session yesterday, I had several other processing chores going on in the background which may have thrown the timing off but it does look like I got a speedup for the overall process but I hesitate to throw out a definitive number (x% faster) as I'm not sure I have a strong baseline to compare against (I keep build times in a log file but due to the processor loading and recently changing to HTML/CSS from SVG, I don't feel like I have a clean timing comparison of just the equation processing to offer up -- if I have time to run something cleaner, I'll note it here).

@dpvc , thank you again for your kind answer and sharing your MathJax and node.js expertise!

dpvc commented 8 years ago

Thanks for the update. I do hope you get an improvement. I'm going to close the issue for now.

NoTalk-ly commented 4 years ago

@dpvc according to what you said above:

Once it has loaded all the files it needs, it can go on and process your first expression, and calls your callback when it is done. When your callback returns, it goes on the second, third, and remaining equations in the same fashion, until the queue of expressions is empty.

that means there is a typeset request queue in processor and this processor only handle one request at the same time ? how can i do to increase the concurrency except fork many processes to run mathjax-node typeset method ?

dpvc commented 4 years ago

@NoTalk-ly, you ask

there is a typeset request queue in processor and this processor only handle one request at the same time ? how can i do to increase the concurrency except fork many processes to run mathjax-node typeset method ?

Since javascript is single-threaded, it can only perform one typesetting action at once. This is inherent in the language itself. The usual solution is to fork processes, as you suggest.

There is a relatively new multi-threaded approach available in node since version 10.5.0 that may provide a lighter-weight solution. See this blog post about node.js multi-threading for a good introduction to node's concurrency issues. It discusses this as a new experimental feature in node, but I think it is now a core feature. See the node worker thread documentation for more details.

Although mathjax-node allows you to run MathJax from the command-line, it doesn't integrate well into node applications. Version 3 of MathJax provides much better node integration than version 2, and so you may want to consider using that instead of mathjax-node. See the MathJax node demos for examples of how this can be done in several different ways. You should be able to combine this with worker threads.