hexojs / hexo

A fast, simple & powerful blog framework, powered by Node.js.
https://hexo.io
MIT License
39.28k stars 4.83k forks source link

The proposal of using Multi Threads #4355

Open SukkaW opened 4 years ago

SukkaW commented 4 years ago

Since #550, the original creator of Hexo, @tommy351 want to speed up Hexo with multi core rendering. However, the #550 is never continued due to the difficulties of managing multiple Hexo instance.

Recently I have brought up Node.js worker_threads for a project (https://github.com/OI-wiki/OI-wiki/pull/2288) and learned something about worker_threads. With Node.js add support for worker_threads, it is now possible to bring up multi core rendering for Hexo again.

Limit

Workers Thread is designed to run CPU intensive tasks with simple algorism:

Independent Input => Workers Calculating => Independent Output

Thus we cannot run many difficult functions inside workers.

Design

As creating workers and destroy workers is still expensive (worker_threads are required to contact with main_thread), we should only create limited number of worker_threads (In https://github.com/OI-wiki/OI-wiki/pull/2288 I use the length of CPU Threads). Thus, a WorkerPool util should be made.

The WorkPool is designed to queue the task, manage task and make sure next task would run in an idle worker, thus it should have those method:

And here is an example about how to use WorkPool:

// index.js
const { join } = require('path');
const { WorkerPool } = require('hexo-util');

const workerPath = join(__dirname + '/some_worker.js');
const cpuNums = require('os').cpus().length;

const pool = new WorkerPool(workerPath, cpuNums);

const tasksList = /* some stuff goes here ... */
const result = {};

Promise.all(tasksList.map(async task => {
  const output = await pool.run(task);

  // do something with output, maybe writeFile or push to a resultArray.
  result[taskId] = output;
}).then(() => {
  pool.destroy();

  // do something with result object.
});
// some_worker.js
const { isMainThread, parentPort } = require('worker_threads');

if (isMainThread) {
  throw new Error('It is not a worker, it seems like a Main Thread');
}

async function job(input) {
  // some stuff...
  return output;
}

parentPort.on('message', async input => {
  const output = await job(input);
  parentPort.postMessage(output);
});

As you can see, the example I given is suitable for some of filters (likes meta_generator, backtick_code_filter) that we pass input to the filter and get output from it. But for more complicated job (like post rendering & template rendering) workers_thread still can't help.

cc @hexojs/core @tommy351

SukkaW commented 4 years ago

cc @hexojs/core @curbengh @stevenjoezhang @jiangtj @segayuu @YoshinoriN @JLHwung

Should we update minimum required Node.js version to 12? Although Hexo 5.0.0 might not require such a high Node.js version, but we could bring up more features during Hexo 5.x development.

curbengh commented 4 years ago

Should we update minimum required Node.js version to 12? As you can see, the example I given is suitable for some of filters

I'm ok with bumping to Node 12, as long as only filters are affected to minimize the delay 5.0.0. Perhaps only change 1-2 filters for now, then other filters can be updated during 5.x.

SukkaW commented 4 years ago

@curbengh We could even release 5.0.0 first, then add multi core support from 5.1.0.

curbengh commented 4 years ago

We could even release 5.0.0 first

It would better to have at least one filter that utilize this API to justify bumping to Node 12 (and demonstrate the benefit of that bump) in 5.0.0.

SukkaW commented 4 years ago

@curbengh

We could start with backtick_code filter.

Take a look at the flamegraph: https://29e28e2d8f6f8fdb247ad2c47788857d003fd894-12-hexo.surge.sh/flamegraph.html

It seems to be a long task.

tuananh commented 4 years ago

This is nice. I have a very good experience with piscina. it's a nice wrapper (and more) around worker_threads.

https://github.com/piscinajs/piscina

SukkaW commented 4 years ago

@tuananh LGTM! It seems definitely better than my WorkerPool: https://github.com/hexojs/hexo-util/pull/212/

tuananh commented 4 years ago

I gave it a try to optimize backtick_code but got DataCloneError error.

haven't gotten around fixing it yet. Not sure if it has anything to do with the way hexo calls all the filter

return Promise.each(filters, filter => Reflect.apply(Promise.method(filter), ctx, args).then(result => {
      args[0] = result == null ? args[0] : result;
      return args[0];
    })).then(() => args[0]);
SukkaW commented 4 years ago

haven't gotten around fixing it yet. Not sure if it has anything to do with the way hexo calls all the filter

@tuananh The entire hexo context just can not be passed to a worker. Only simple objects (like string, number, plain object) can be passed to a worker.

SukkaW commented 4 years ago

Here's what we can learn #4368

According to the documents of the worker_threads:

value will be transferred in a way which is compatible with the HTML structured clone algorithm.

Which means:

Function objects cannot be duplicated by the structured clone algorithm; attempting to throws a DATA_CLONE_ERR exception.

structured clone algorithm also means contacting with threads is expensive, just like creating & destroying one. We should keep the input and output pure and simple (only contains required information) to make structured clone faster.

tuananh commented 4 years ago

@SukkaW that's probably it. in order to change that, we need to change the way we pass hexo instance around?

SukkaW commented 4 years ago

Instead of worker_threads, I am considering using cluster API instead.

cluster API is much simpler, and is stable since Node.js 4.0. It has no "structured clone algorithm" things as well.

The only problem is cluster is designed to handle multi http requests. We have to find a way to adopt it to Hexo.

@curbengh @tuananh

stevenjoezhang commented 5 months ago

From the perspective of 2024, the support for multithreading in Node.js has not improved. The rendering process of posts heavily relies on Hexo's ctx, but without the ability to use shared memory, worker threads cannot directly access the global variables in Hexo.

SukkaW commented 5 months ago

From the perspective of 2024, the support for multithreading in Node.js has not improved. The rendering process of posts heavily relies on Hexo's ctx, but without the ability to use shared memory, worker threads cannot directly access the global variables in Hexo.

So this basically leaves us with 2 options:

tuananh commented 5 months ago

From the perspective of 2024, the support for multithreading in Node.js has not improved. The rendering process of posts heavily relies on Hexo's ctx, but without the ability to use shared memory, worker threads cannot directly access the global variables in Hexo.

So this basically leaves us with 2 options:

* Creating multiple Hexo instances in different worker threads. In every thread, we will read the config and posts.

* Offloading limited heavy tasks to the worker threads (markdown rendering? nunjucks rendering?) while retaining one main Hexo instance.

option 2 sounds better to me