nodejs / node

Node.js JavaScript runtime ✨🐢🚀✨
https://nodejs.org
Other
106.96k stars 29.2k forks source link

VM: vm.runInNewContext with timeout got perf impact #52261

Open Lellansin opened 6 months ago

Lellansin commented 6 months ago

Version

v20.10.0

Platform

Linux one-87-c49d8fd65-xg6j9 5.4.247-162.350.amzn2.x86_64 #1 SMP Tue Jun 27 22:03:59 UTC 2023 x86_64 GNU/Linux

Subsystem

vm

What steps will reproduce the bug?

We encountered a big performance difference in one particular scenario. Using vm script with timeout options, like:

const { runInContext, createContext } = require('vm');
const http = require('http');
const context = createContext({});
const server = http.createServer(async (req, res) => {
  await runInContext(
    ``, // empty code
    context, // empty context
    {
      timeout: 300, // but with default timeout
    });
  res.write('Hello, World!\n');
  res.end();
});
server.listen(8080, () => {
  console.log('Server is running on http://localhost:8080');
});

Then do a apache ab test: ab -n1000 -c100 http://localhost:8080/.

The programs are running in pods of 2 different k8s clusters. So I also made an image, which we can build with Dockerfile like:

# FROM amd64/node:21.7.1
# FROM amd64/node:16.16.0
FROM amd64/node:20.10.0
# FROM amd64/node:20.11.0
# FROM node:20.10.0

RUN apt-get update && apt-get install apache2-utils

WORKDIR /root/home

COPY . .

# generate application version
RUN TZ=Asia/Shanghai date +%Y_%m%d_%H%M > .version

# startup cmd
ENTRYPOINT [ "/bin/bash", "/root/home/entrypoint.sh" ]

the entrypointh.sh is like:

#!/bin/bash

(sleep 3 && time curl localhost:8080/) &
(sleep 3 && echo "start apache ab benchmark") &
(sleep 4 && ab -n1000 -c100 localhost:8080/) &

node /root/home/test.js

If docker is installed, you can directly test it with this command docker run -it lellansin/test-vm-scripts:0.1.

We got:

How often does it reproduce? Is there a required condition?

Must appear on the problem machine now.

What is the expected behavior? Why is that the expected behavior?

The performance of normal machines and problem machines should not be so different.

What do you see instead?

During my troubleshooting process, I noticed that this Tuesday (2024-03-26), Node.js released the v20.12.0 version, which included a vm: fix V8 compilation cache support for vm.Script.

Then I tested the upgrade from v2.10.0 -> v2.12.0, and the new version did fix this problem. it's seem that hitting the V8 compilation cache is more of a workaround than a final solution to the current problem. With the desire to know the root cause of the problem, I still want to confirm what specifically causes this problem.

According to this code, the difference between setting timeout or not is that setting it will initialize a watchdog object. This object will initialize the event loop and corresponding thread in the constructor.

The CPU utilization on the problem machine is always below 10%. It looks like a certain resource is restricted, or the main thread of Node.js is waiting for something.

Through ps -eLf, I observed that the Node.js process can scale from 11 to 28 on the normal machine, and the number of threads on the problem machine is always 11.

I tried to write the following libuv test code:

#include <iostream>
#include <pthread.h>
#include <vector>
#include <chrono>
#include <uv.h>

using namespace std;

struct ThreadData {
    uv_loop_t* loop;
    uv_timer_t timer;
    bool timed_out;
};

void timer_cb(uv_timer_t* handle) {
    ThreadData* data = static_cast<ThreadData*>(handle->data);
    data->timed_out = true;
    uv_stop(data->loop);
}

void* threadFunc(void* arg) {
    ThreadData* data = static_cast<ThreadData*>(arg);

    data->loop = new uv_loop_t;
    uv_loop_init(data->loop);

    data->timed_out = false;
    uv_timer_init(data->loop, &data->timer);
    data->timer.data = data;

    uv_timer_start(&data->timer, timer_cb, 100, 0);

    uv_run(data->loop, UV_RUN_DEFAULT);

    uv_loop_close(data->loop);
    delete data->loop;

    return nullptr;
}

int main() {
    const int NUM_THREADS = 10;
    pthread_t threads[NUM_THREADS];
    ThreadData threadData[NUM_THREADS];

    for (int i = 0; i < NUM_THREADS; ++i) {
        if (pthread_create(&threads[i], nullptr, threadFunc, &threadData[i])) {
            cerr << "Error creating thread " << i << endl;
            return 1;
        }
    }

    for (int i = 0; i < NUM_THREADS; ++i) {
        pthread_join(threads[i], nullptr);
    }

    int timed_out_count = 0;
    for (int i = 0; i < NUM_THREADS; ++i) {
        if (threadData[i].timed_out) {
            timed_out_count++;
        }
    }

    cout << "Timed out threads: " << timed_out_count << endl;

    return 0;
}

And found that there is not much difference between the normal machine and the problem machine.

Additional information

No response

Lellansin commented 6 months ago

The both k8s clusters normal machines and problem machines mentioned are configured with CPU overcommit.