nodejs / node

Node.js JavaScript runtime ✨🐢🚀✨
https://nodejs.org
Other
105.48k stars 28.6k forks source link

Node 20.3 Crashes all the time when executed inside docker #48444

Closed ronag closed 1 month ago

ronag commented 1 year ago
ronag commented 1 year ago

Another possible ref https://github.com/electron/rebuild/pull/1085

ronag commented 1 year ago

@nodejs/libuv

bnoordhuis commented 1 year ago

"Text file busy" means trying to write a shared object or binary that's already in use.

My hunch is that node-gyp has some race condition in reading/writing files that wasn't manifesting (much) when everything still went through the much slower thread pool, whereas io_uring is fast enough to make it much more visible.

ronag commented 1 year ago

Is there a way to disable ioring using a env variable as a temporary workaround when running node-gyp?

bnoordhuis commented 1 year ago

Yes, set UV_USE_IO_URING=0 in the environment. Use at your own risk: not a stable thing, will disappear again the future.

ronag commented 1 year ago
test-srv3.hq.lan:~# docker run -it node:20.3.0 bash
root@6f6c66eb5077:/# UV_USE_IO_URING=0 yarn add bufferutil
node[8]: ../src/node_platform.cc:68:std::unique_ptr<long unsigned int> node::WorkerThreadsTaskRunner::DelayedTaskScheduler::Start(): Assertion `(0) == (uv_thread_create(t.get(), start_thread, this))' failed.
 1: 0xc8e4a0 node::Abort() [node]
 2: 0xc8e51e  [node]
 3: 0xd0a059 node::WorkerThreadsTaskRunner::WorkerThreadsTaskRunner(int) [node]
 4: 0xd0a17c node::NodePlatform::NodePlatform(int, v8::TracingController*, v8::PageAllocator*) [node]
 5: 0xc4bbc4 node::V8Platform::Initialize(int) [node]
 6: 0xc49408  [node]
 7: 0xc497db node::Start(int, char**) [node]
 8: 0x7ff94ec5818a  [/lib/x86_64-linux-gnu/libc.so.6]
 9: 0x7ff94ec58245 __libc_start_main [/lib/x86_64-linux-gnu/libc.so.6]
10: 0xba9ade _start [node]
Aborted
ronag commented 1 year ago

This is actually worse than I thought. Node doesn't run at all with 20.3

ronag commented 1 year ago
test-srv3.hq.lan:~# docker run -it node:20.3.0 node
node[1]: ../src/node_platform.cc:68:std::unique_ptr<long unsigned int> node::WorkerThreadsTaskRunner::DelayedTaskScheduler::Start(): Assertion `(0) == (uv_thread_create(t.get(), start_thread, this))' failed.
 1: 0xc8e4a0 node::Abort() [node]
 2: 0xc8e51e  [node]
 3: 0xd0a059 node::WorkerThreadsTaskRunner::WorkerThreadsTaskRunner(int) [node]
 4: 0xd0a17c node::NodePlatform::NodePlatform(int, v8::TracingController*, v8::PageAllocator*) [node]
 5: 0xc4bbc4 node::V8Platform::Initialize(int) [node]
 6: 0xc49408  [node]
 7: 0xc497db node::Start(int, char**) [node]
 8: 0x7f5c393be18a  [/lib/x86_64-linux-gnu/libc.so.6]
 9: 0x7f5c393be245 __libc_start_main [/lib/x86_64-linux-gnu/libc.so.6]
10: 0xba9ade _start [node]
test-srv3.hq.lan:~# docker run -it node:20.2.0 node
Welcome to Node.js v20.2.0.
Type ".help" for more information.
> 
mcollina commented 1 year ago

I can reproduce this, and this is quite critical.

mcollina commented 1 year ago

cc @nodejs/tsc for visibility

richardlau commented 1 year ago

FWIW on two systems I have access to (a Red Hat owned RHEL 8 machine and test-digitalocean-ubuntu1804-docker-x64-1 from the Build infra) docker run -it node:20.3.0 node is fine:

root@test-digitalocean-ubuntu1804-docker-x64-1:~# docker run -it node:20.3.0 node
Unable to find image 'node:20.3.0' locally
20.3.0: Pulling from library/node
bba7bb10d5ba: Pull complete
ec2b820b8e87: Pull complete
284f2345db05: Pull complete
fea23129f080: Pull complete
9063cd8e3106: Pull complete
4b4424ee38d8: Pull complete
0b4eb4cbb822: Pull complete
43443b026dcf: Pull complete
Digest: sha256:fc738db1cbb81214be1719436605e9d7d84746e5eaf0629762aeba114aa0c28d
Status: Downloaded newer image for node:20.3.0
Welcome to Node.js v20.3.0.
Type ".help" for more information.
>

I can reproduce the assertion failure on an Ubuntu 16.04 host with node:20.3.0 but not with node:20.3.0-bullseye:

root@infra-digitalocean-ubuntu1604-x64-1:~# docker run -it node:20.3.0 node
Unable to find image 'node:20.3.0' locally
20.3.0: Pulling from library/node
bba7bb10d5ba: Pull complete
ec2b820b8e87: Pull complete
284f2345db05: Pull complete
fea23129f080: Pull complete
9063cd8e3106: Pull complete
4b4424ee38d8: Pull complete
0b4eb4cbb822: Pull complete
43443b026dcf: Pull complete
Digest: sha256:fc738db1cbb81214be1719436605e9d7d84746e5eaf0629762aeba114aa0c28d
Status: Downloaded newer image for node:20.3.0
node[1]: ../src/node_platform.cc:68:std::unique_ptr<long unsigned int> node::WorkerThreadsTaskRunner::DelayedTaskScheduler::Start(): Assertion `(0) == (uv_thread_create(t.get(), start_thread, this))' failed.
 1: 0xc8e4a0 node::Abort() [node]
 2: 0xc8e51e  [node]
 3: 0xd0a059 node::WorkerThreadsTaskRunner::WorkerThreadsTaskRunner(int) [node]
 4: 0xd0a17c node::NodePlatform::NodePlatform(int, v8::TracingController*, v8::PageAllocator*) [node]
 5: 0xc4bbc4 node::V8Platform::Initialize(int) [node]
 6: 0xc49408  [node]
 7: 0xc497db node::Start(int, char**) [node]
 8: 0x7f6e8486218a  [/lib/x86_64-linux-gnu/libc.so.6]
 9: 0x7f6e84862245 __libc_start_main [/lib/x86_64-linux-gnu/libc.so.6]
10: 0xba9ade _start [node]
root@infra-digitalocean-ubuntu1604-x64-1:~# docker run -it node:20.3.0-bullseye node
Unable to find image 'node:20.3.0-bullseye' locally
20.3.0-bullseye: Pulling from library/node
93c2d578e421: Already exists
c87e6f3487e1: Already exists
65b4d59f9aba: Already exists
d7edca23d42b: Already exists
25c206b29ffe: Already exists
599134452287: Pull complete
bd8a83c4c2aa: Pull complete
d11f4613ae42: Pull complete
Digest: sha256:ceb28814a32b676bf4f6607e036944adbdb6ba7005214134deb657500b26f0d0
Status: Downloaded newer image for node:20.3.0-bullseye
Welcome to Node.js v20.3.0.
Type ".help" for more information.
>

Our website build is actually broken running apt update with the default Node.js LTS image based on Debian 12 (bookworm) - - we've switched to the Debian 11 (bullseye) based image for now: https://github.com/nodejs/build/issues/3382

targos commented 1 year ago

FWIW I opened an issue about this in the docker-node repo: https://github.com/nodejs/docker-node/issues/1918

TLDR: this is not a problem with Node.js itself, but with the default base OS used by the Docker image, which was upgraded for v20.3.0.

ronag commented 1 year ago

bullseye works for me as well

ronag commented 1 year ago

Now I also get the file busy error:

test-srv3.hq.lan:~# docker run -it node:20.3.0-bullseye bash
root@ed020dd3f80e:/# yarn add bufferutil
yarn add v1.22.19
info No lockfile found.
[1/4] Resolving packages...
[2/4] Fetching packages...
[3/4] Linking dependencies...
[4/4] Building fresh packages...
error /node_modules/bufferutil: Command failed.
Exit code: 126
Command: node-gyp-build
Arguments: 
Directory: /node_modules/bufferutil
Output:
/bin/sh: 1: node-gyp-build: Text file busy

EDIT: Works with UV_USE_IO_URING=0

ronag commented 1 year ago

So to summarize:

ronag commented 1 year ago

Should I split the uring problem into a separate issue?

bnoordhuis commented 1 year ago

Can someone post the result of strace -o trace.log -yy -f node app.js when it crashes with that uv_thread_create check? I expect to see a failing clone/clone2/clone3 system call but it'd be good to confirm.

richardlau commented 1 year ago

On the Ubuntu 16.04 infra machine I cannot run apt in the bookworm based node:20.3.0 or node:lts containers to install strace in them (it's not there by default).

Another datapoint, adding --security-opt=seccomp:unconfined makes this work on the Ubuntu 16.04 host:

root@infra-digitalocean-ubuntu1604-x64-1:~# docker run --security-opt=seccomp:unconfined -it node:20.3.0 node
Welcome to Node.js v20.3.0.
Type ".help" for more information.
>
bnoordhuis commented 1 year ago

Right, then I can predict with near 100% certainty what the problem is: docker doesn't know about the newish clone3 system call. Its seccomp filter rejects it with some bogus error and node consequently fails when it tries to start a new thread.

This docker seccomp thing is like clockwork, it always pops up when new system calls are starting to see broader use. It's quite possibly fixed in newer versions.

mcollina commented 1 year ago

Updating docker to the latest version fixed it (v24.0.2) for me.

A few notes:

Here is what I think we should do:

This seems a future-proof solution while keeping the current functionality available.

bnoordhuis commented 1 year ago

document this error and the UV_USE_IO_URING=0 solution for v20

UV_USE_IO_URING is (intentionally) undocumented and going away again so don't do that.

mcollina commented 1 year ago

@bnoordhuis Would you just document this as "if you are hit by this bug, update docker"?

ronag commented 1 year ago

I think there are two different things here. I'm not sure updating docker will help with the uring problem. Or does it? Please confirm.

santigimeno commented 1 year ago

If I'm reading this correctly there are 2 separate issues here.

I think we should try to understand better the 2nd issue before disabling it.

jkuchar commented 1 year ago

The same thing we are experiencing here: https://github.com/nodejs/docker-node/issues/1912#issuecomment-1594408113

mcollina commented 1 year ago

I cannot reproduce the bufferutils issue with latest docker.

bnoordhuis commented 1 year ago

@bnoordhuis Would you just document this as "if you are hit by this bug, update docker"?

Yes.

mcollina commented 1 year ago

Tagging @nodejs/tsc for visibility.

@nodejs/docker are you comfortable with this approach?

ronag commented 1 year ago

In my opinion this will cause too much disturbance. Until enterprises have had more time to upgrade docker we should:

I don't see how we could have an LTS release with the current situation.

jkuchar commented 1 year ago

I have tested this on the latest release of Docker Desktop on Mac, I have the same issue described here

mhdawson commented 1 year ago

I agree with @ronag that the balance between wanting to enable better performance by default and not causing undue ecosystem pain suggests that either fixing the ioring issues or disabling ioring for Node.js 20 and leaving enabled in main is the best way forward. Maybe we can detect if we are running under docker any only disable it in that case?

trevnorris commented 1 year ago

Let's just be clear that this isn't a problem with iouring. The proposal is to work around a Docker deficiency, and in the process lose non-trivial performance gains for all other supported systems.

santigimeno commented 1 year ago

I'm more than happy to take a look at the busy file issue but haven't been able to reproduce it: I've tried in different linux VM's with different docker versions with no luck. Can anyone provide instructions to reproduce it in non-obscure os? (the better if it's linux based) Thanks!

nschonni commented 1 year ago

We've updated to the latest release as the default tags in the past like with bullseye, so I don't think we should change that behaviour this time. Although I'm one of the docker maintainers, it's not something I deal with everyday, but these are the are the recent collision of a few issues I think are being seen:

  1. ioring issues are happening on other OSes included the Alpine versions with the latest v20
  2. bookworm requires a newer Docker version for some of this seccomp stuff, and in some cases hasn't rolled out to all OSes equally. This isn't really any different than with GitHub Actions having unbuntu-latest vs ubuntu-20.04 tags. Unless you opt in to a specific version there may be cases were stuff like this happens
  3. bookworm changed some core simlinks that I believe I saw @yosifkit or @tianon mention on another issue thread. People that were broadly doing COPY . . operations were finally getting bitten by their bad config
  4. bookworm moved from OpenSSL 1 to 3, and some people had made assumptions on installed packages for OpenSSL 1, but that wasn't something we ever intentionally included with the images
ronag commented 1 year ago

Let's just be clear that this isn't a problem with iouring. The proposal is to work around a Docker deficiency, and in the process lose non-trivial performance gains for all other supported systems.

IMHO. Who's fault it is isn't the main issue. The problem is that is causes non trivial disturbance for many users.

santigimeno commented 1 year ago

FWIW, I investigated it a bit and came up with a minimum test case for the Text file busy issue.

const { open, close } = require('node:fs');
const { spawn } = require('node:child_process');

const filePath = '/tmp/file';

open(filePath, 'w', 0o755, (err, fd) => {
  if (err) throw err;
  close(fd, (err) => {
    if (err) throw err;
    // Wait a bit to be sure there's no race condition there
    setTimeout(() => {
      spawn(filePath);
    }, 1000);
  });
});

A couple of remarks:

santigimeno commented 1 year ago

Update: testing on an Ubuntu 22.04 with an older kernel version: 5.15.30 , triggers the same ETXTBUSY error. So it seems to me like the problem lies in the interaction between an, at least, kernel version 5.15.78 or older and the bullseye node docker image.

bnoordhuis commented 1 year ago

Interesting findings, not sure what to make of it yet. Is it the close() or the execve() system call that returns ETXTBSY?

rubiesonthesky commented 1 year ago

I have had this problem with CircleCI node image cimg/node:20.3.0 which is based on Ubuntu 22.04. This happens when trying to install dependencies which need node-gyp. I hope this info can help testing the possible fix or pinpoint when it happens.

santigimeno commented 1 year ago

Interesting findings, not sure what to make of it yet. Is it the close() or the execve() system call that returns ETXTBSY?

It's the execve() call. that's why I said that the issue seems to be related to the IORING_OP_CLOSE, as issuing the close() directly doesn't exhibit the issue.

bnoordhuis commented 1 year ago

I checked what changed for IORING_OP_CLOSE in fs/io_uring.c between 5.13-5.15 (little) and 5.15-5.16 (nothing) but I don't see anything that explains such behavior. Does it reproduce outside docker?

ludomikula commented 1 year ago

From my experience only in docker. Builds ok outside.

santigimeno commented 1 year ago

That's correct. It doesn't happen when running the code in the host, so there must be some specific interaction with docker.

santigimeno commented 1 year ago

BTW, there must be some specific change between 5.15.78 and 5.15.98 which fixed the issue

bnoordhuis commented 1 year ago

There were 34 io_uring fixes between those patch releases but none that stand out as obvious candidates for this particular bug. I don't like that the cause isn't really understood but there's only so much you can do. Workaround in https://github.com/libuv/libuv/pull/4059.

santigimeno commented 1 year ago

I've narrowed it down more: 5.15.85 has the error and 5.15.90 doesn't. 5.15.90 looks like a good candidate to have the fix as it has numerous io_uring related commits, though none clearly obvious to me: https://cdn.kernel.org/pub/linux/kernel/v5.x/ChangeLog-5.15.90

bnoordhuis commented 1 year ago

Santi, no doubt you enjoyed the v1.45.0 release immensely. Want to do v1.46.0 too?

santigimeno commented 1 year ago

Yes, I'll take care of it.

santigimeno commented 1 year ago

Though we should confirm first that the issue is actually fixed

ronag commented 1 year ago

Does this also solve the non bullseye image crash?