vercel / next.js

The React Framework
https://nextjs.org
MIT License
124.96k stars 26.69k forks source link

Hight number of processes of /next/dist/compiled/jest-worker/processChild.js still alive after next build #45508

Closed zqjimlove closed 1 year ago

zqjimlove commented 1 year ago

Verify canary release

Provide environment information

Operating System:
  Platform: darwin
  Arch: arm64
  Version: Darwin Kernel Version 22.3.0: Thu Jan  5 20:48:54 PST 2023; root:xnu-8792.81.2~2/RELEASE_ARM64_T6000
Binaries:
  Node: 18.13.0
  npm: 8.19.3
  Yarn: 1.22.19
  pnpm: 7.26.2
Relevant packages:
  next: 12.0.9
  react: 17.0.2
  react-dom: 17.0.2

Which area(s) of Next.js are affected? (leave empty if unsure)

CLI (create-next-app)

Link to the code that reproduces this issue

https://github.com/vercel/next.js/files/10565355/reproduce.zip

To Reproduce

reproduce.zip

image

This problem can reproduce above next@12.0.9, but 12.0.8 was all right.

Or remove getInitialProps in _app.tsx was all right above next@12.0.9.

// GlobalApp.getInitialProps = async function getInitialProps(appContext) {
//   const appProps = await App.getInitialProps(appContext);

//   return {
//     ...appProps,
//   };
// };

Describe the Bug

Hight number of processes of /next/dist/compiled/jest-worker/processChild.js still alive after next build

Expected Behavior

Kill all child processes.

Which browser are you using? (if relevant)

No response

How are you deploying your application? (if relevant)

No response

NEXT-1348

EpicIvo commented 1 year ago

Hosting on Platform.sh here, still using pages dir and downgrading from 13.4 to 13.2.4 seems to have solved the issue for now 👌🏽

ijjk commented 1 year ago

@cannontrodder is correct and the reduced workers can explain any noticed slowdown, please upgrade to 13.4.6 and see if this alleviates any issues noticed here!

daiyam commented 1 year ago

I'm using pages dir and I still have the issue with 13.4.6.

ijjk commented 1 year ago

@daiyam what is the issue you seeing persisting in v13.4.6?

daiyam commented 1 year ago

@ijjk Yes, if /next/dist/compiled/jest-worker/processChild.js are only created by next build. (I've only tested on a new website in prod, so there is always next start just after. I found out about the issue only yesterday, after the docker container was using 3GB of memory...)

masterkain commented 1 year ago

this thread seems to be a mixed bag of people having issues during build time and/or runtime -- I can add my experience with the runtime https://github.com/vercel/next.js/issues/49623 (tl;dr add RAM)

masterkain commented 1 year ago

We had 1gb in our pods. We upped to 2gb. This prevented the freak out the Friday before last where our pods just kept rebooting and scales to 10. They’d always peak on spin up, as the cache was cold. We are looking into sharing the cache between pods to help with that.

I did exactly that in our helm chart (https://github.com/icoretech/helm/blob/main/charts/airbroke/values.yaml#L22) but did not change a thing and also I'm not sure this should be done

cjcheshire commented 1 year ago

@masterkain there's a thread here on this: https://github.com/vercel/next.js/discussions/23017#discussioncomment-5230940. There’s a flag isrMemoryCacheSize you’ll need to set to zero for it to work it looks like.

Slightly worrying about it possible causing a race condition though - https://nextjs.org/docs/pages/building-your-application/data-fetching/incremental-static-regeneration#self-hosting-isr

masterkain commented 1 year ago

There’s a flag isrMemoryCacheSize you’ll need to set to zero for it to work it looks like.

Slightly worrying about it possible causing a race condition though - https://nextjs.org/docs/pages/building-your-application/data-fetching/incremental-static-regeneration#self-hosting-isr

interesting, thanks for that, I did think about the race condition but I was missing isrMemoryCacheSize, time to experiment again, but I deviated a bit from the original topic so back to you people 👍

lazarv commented 1 year ago

Next.js 13 using an app router (which is by default on since 13.4) is always using workers to run the app, see at https://github.com/vercel/next.js/blob/canary/packages/next/src/server/lib/start-server.ts#L182, while the main app (main thread) is acting as a proxy for the workers. I created an issue about this at https://github.com/vercel/next.js/issues/50586

At runtime, this is not related to any build process, as I see. Next.js is just using jest-worker as a solution to start child processes. My assumption is that this is like that to speed up the new RSC rendering, as that is not optimal, see https://github.com/vercel/next.js/blob/canary/packages/next/src/server/app-render/use-flight-response.tsx

broksonic21 commented 1 year ago

Like @lazarv , we likewise saw way better support when disabling appDir. Not definitive, so your YMMV, but I put notes in https://github.com/vercel/next.js/issues/49929#issuecomment-1602592624 and https://github.com/vercel/next.js/issues/51560#issuecomment-1599458889 for how the extra processes impact memory and crashes/timeouts we saw in production on 13.4.4+

madjam002 commented 1 year ago

All of my servers have been seeing TCP: out of memory -- consider tuning tcp_mem in the kernel logs recently, causing Nginx connection reset issues when talking to the NextJS upstream, after digging into it there's a bunch of connections appearing from the NextJS process like so:

$ ip netns exec xxx ss -aemnpt

...
CLOSE-WAIT               142051                 0                                                127.0.0.1:48390                                          127.0.0.1:33853                 users:(("node",pid=14777,fd=85)) ino:88488113 sk:5a -->
     skmem:(r208866,rb2358342,t0,tb2626560,f30,w0,o0,bl0,d0)                                                                       
CLOSE-WAIT               174810                 0                                                127.0.0.1:45838                                          127.0.0.1:33853                 users:(("node",pid=14777,fd=50)) ino:88446849 sk:5c -->
     skmem:(r183190,rb4978722,t0,tb2626560,f1130,w0,o0,bl0,d1)                                                                     
....

All going out to port 33853, which turns up "jest-worker/processChild.js", which lead me to find this GitHub issue

$ ip netns exec xxx netstat -ltnp
Active Internet connections (only servers)
Proto Recv-Q Send-Q Local Address           Foreign Address         State       PID/Program name    
...    
tcp6       0      0 :::33853                :::*                    LISTEN      14927/node 

$ ps aux | grep 14927
root     14927  0.5  0.4 32811872 135924 ?     Sl   14:13   0:06 /nix/store/m00hsyaqpin3awwjyx0v7lxwzix73ibd-next-13.4.6-c2942e4eb7.zip/node_modules/next/dist/compiled/jest-worker/processChild.js

After a few hours there are thousands of connections in the CLOSE-WAIT state per NextJS process. Seeing this on NextJS 13.4.7 as well. Downgrading to 13.2.4 fixes the issue.

hnsr commented 1 year ago

Confirming this issue (at runtime). We deploy in standalone mode and run our app with PM2 instances, but PM2 no longer reports accurate memory usage (making the max memory restart feature broken), and leaves jest-worker processes running even after killing the parent PM2 instances, causing constant OOM situations.

Innei commented 1 year ago

Confirming this issue (at runtime). We deploy in standalone mode and run our app with PM2 instances, but PM2 no longer reports accurate memory usage (making the max memory restart feature broken), and leaves jest-worker processes running even after killing the parent PM2 instances, causing constant OOM situations.

Is there another solution, my app also deploys in standalone mode and runs our app with PM2 instances, but not using appDir and next/images, middleware as well.

Something, /dist/standalone/node_modules/.pnpm/next@13.4.5_@babel+core@7.21.8_react-dom@18.2.0_react@18.2.0/node_modules/next/dist/compiled/jest-worker/processChild.js memory leak and not trigger system OOM killer, leads to system hanging up.

Innei commented 1 year ago

About atop monitor. It leak 1G memory costs about 5 seconds

CleanShot 2023-06-27 at 5 23 04@2x
uncvrd commented 1 year ago

Confirming this issue (at runtime). We deploy in standalone mode and run our app with PM2 instances, but PM2 no longer reports accurate memory usage (making the max memory restart feature broken), and leaves jest-worker processes running even after killing the parent PM2 instances, causing constant OOM situations.

@hnsr were you able to confirm that disabling appDir reduced your memory footprint on your project? I see that you referenced a PR here and I just gave it a shot on mine. I made sure to match your NextJS version (13.4.5) and still had a large bump in memory consumption upon first request 🤔

image
hnsr commented 1 year ago

@uncvrd It seems that disabling appDir didn't work for us either to avoid the use of jest-worker when running the standalone server.js. I haven't had time to look into it further, but will give this another look and let you know if I find a workaround.

hnsr commented 1 year ago

Confirming this issue (at runtime). We deploy in standalone mode and run our app with PM2 instances, but PM2 no longer reports accurate memory usage (making the max memory restart feature broken), and leaves jest-worker processes running even after killing the parent PM2 instances, causing constant OOM situations.

Is there another solution, my app also deploys in standalone mode and runs our app with PM2 instances, but not using appDir and next/images, middleware as well.

Something, /dist/standalone/node_modules/.pnpm/next@13.4.5_@babel+core@7.21.8_react-dom@18.2.0_react@18.2.0/node_modules/next/dist/compiled/jest-worker/processChild.js memory leak and not trigger system OOM killer, leads to system hanging up.

Our hoster also noted this, that the whole server we run on went down, as opposed to the OOM killer being invoked to keep things under control. I wonder why that is 🤔

hnsr commented 1 year ago

@uncvrd So it seems that the server.js that is generated in standalone is simply written to always use workers:

As you can see the new version uses createServerHandler from https://github.com/vercel/next.js/blob/canary/packages/next/src/server/lib/render-server-standalone.ts which always uses workers.

I am probably going to see if we can simply stick to an older version of nextjs for now

Innei commented 1 year ago

Yes, OOM killer not working. This is my server monitor, it increases IO, CPU, Memory suddenly.

CleanShot 2023-06-27 at 9 19 43@2x
uncvrd commented 1 year ago

@hnsr thanks for confirming on your end. That's really odd, I had some luck reverting to 13.3.2 today so I'll stick with that for now

hnsr commented 1 year ago

The worker process seems to be introduced in this commit: Fix standalone mode with appDir running in a single proces

This is released in 13.4.0, the last release without the workers is 13.3.4

Why we are downgrading:

  1. We had an issue where a process crashed, but the worker wasn't cleaned up. PM2 restarted nextjs several times a second, causing it to eat up 60GB of memory in a few seconds, crashing the server.

  2. We're currently using pm2 to run the application, but pm2 is unable to report the used memory (which we are trying to use to automatically restart when it is running out).

    Creating a worker process for the standalone mode seems somewhat odd in my opinion, according to the docs:

    Additionally, a minimal server.js file is also output which can be used instead of next start.

    Using separate worker processes isn't as minimal as it could be. Is there any way to get it 'flat' again @shuding?

billnbell commented 1 year ago

Or maybe we can add a parameter to turn off workers in Standalone mode?

What would be the impact of turning it off? And why was it added?

ijjk commented 1 year ago

The separate processes are needed to ensure app and pages routes are rendered separately as they require different versions of react. The workers also already monitor memory usage and restart when running out so pm2 shouldn't be needed here to achieve that.

What was the crash where the workers weren't cleaned up, sounds like that's more of the issue we should be addressing here.

broksonic21 commented 1 year ago

@ijjk not sure if the crash you speak of, but I can create a crash even on the stock template app with workers here: https://github.com/vercel/next.js/issues/51560

Doesn’t repro without workers. (appDir false)

S-YOU commented 1 year ago

app and pages routes are rendered separately as they require different versions of react.

Didn't realize two versions of react is needed. Hopefully it will be one in the future then.

billnbell commented 1 year ago

OK, can we set the usage of memory on the box? 90%might be too much or too low.

const MAXIMUM_HEAP_SIZE_ALLOWED =
(v8.getHeapStatistics().heap_size_limit / 1024 / 1024) * 0.9

Question - OR for us only using pages can we turn off the workers mode?

billnbell commented 1 year ago

Also, if our process has NODE memory set with --max-old-space-size=8192 will v8.getHeapStatistics().heap_size_limit return right value?

Yes.

app.js

 const maxHeapSz = require('v8').getHeapStatistics().heap_size_limit;
 const maxHeapSz_GB = (maxHeapSz / 1024 ** 3).toFixed(1);
 console.log(`${maxHeapSz_GB}GB`);

node --max-old-space-size=2048 app.js 2.0GB

hnsr commented 1 year ago

The separate processes are needed to ensure app and pages routes are rendered separately as they require different versions of react. The workers also already monitor memory usage and restart when running out so pm2 shouldn't be needed here to achieve that.

What was the crash where the workers weren't cleaned up, sounds like that's more of the issue we should be addressing here.

@ijjk Yes makes sense that this is the thing that needs to be investigated and hopefully fixed. The crashes in our case were mainly startup errors, i.e. EADDRINUSE because the standalone server.js failed to bind on port 3000 at one point. Another earlier cause was a SynaxError (due to running the wrong node version). During next dev this can happen as well if I make a typo, hot reloading will fail and it will crash, leaving a jest-worker running. Since I am a bad programmer, this can lead to my macbook going OOM 😅

For our production setup I would still like to have some control over memory usage through. The way PM2 allowed us to do this through --max-memory-restart was ideal for us; is there any documentation on how we can accomplish this with the workers that next now uses?

nicosh commented 1 year ago

@hnsr not sure if this can help, but this is how we make PM2 to kill processes when respawning :

kill_processChild.sh

#!/bin/bash

# Find the process IDs of all processes containing the string "processChild.js" in the command path
pids=$(pgrep -f "processChild.js")

# Iterate over each process ID and kill the corresponding process
for pid in $pids; do
    echo "Killing process: $pid"
    kill "$pid"
done

Pm2 ecosystem.config.js

module.exports = {
  apps: [
 {
      name: 'main',
      script: 'npm',
      args: 'run app:start:force',
    },
  ],
};

and in package.json i have this script

"app:start:force": "./kill_processChild.sh && cd apps/cms && npm run start",

In this way pm2 will kill orphan process childs before respawn the application.

cjcheshire commented 1 year ago

Has anyone tried 13.4.8 yet?

S-YOU commented 1 year ago

No difference on performance so far for 13.4.7 -> 13.4.8 in standalone production mode for me.

billnbell commented 1 year ago

Is it better or still running out of memory ?

Get Outlook for iOShttps://aka.ms/o0ukef


From: S-YOU @.> Sent: Monday, July 3, 2023 9:36:37 PM To: vercel/next.js @.> Cc: William Bell @.>; Mention @.> Subject: Re: [vercel/next.js] Hight number of processes of /next/dist/compiled/jest-worker/processChild.js still alive after next build (Issue #45508)

No difference on performance so far for 13.4.7 -> 13.4.8 in standalone production mode for me.

— Reply to this email directly, view it on GitHubhttps://github.com/vercel/next.js/issues/45508#issuecomment-1619422088, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AGBTSQL52VNMTHAEGSCMDC3XOOFULANCNFSM6AAAAAAUOSW3IM. You are receiving this because you were mentioned.Message ID: @.***>

S-YOU commented 1 year ago

Is it better or still running out of memory ?

I believe the memory or cpu issue is because of high traffic, and my sites don't have enough traffic to reproduce the issue since the start and also I normally tune application periodically for general performance problems. I am here in this thread just because I am the one who informed 13.4.0 has two new jest-worker processes in the Discussion.

DenisBessa commented 1 year ago

I can confirm the problem stil happens 13.4.8.

The weird part is that I cannot reproduce it. It looks just random to me.

constmoon commented 1 year ago
// next.config.js
// next 13.4.7
experimental: {
    appDir: false 
}

Fortunately, since it was before using app route, I resolved the issue of excessive processChild.js processes by adding option in next.config.js https://github.com/vercel/next.js/issues/49929#issuecomment-1602592624. I hope the issue is resolved in the next version so that excessive process waste doesn't occur even without using such experimental options.

billnbell commented 1 year ago

@constmoon Does this fix the issue on latest Next Versions? Or were you just adding that we should turn if off?

constmoon commented 1 year ago

@billnbell I am using Next.js 13.4.7 and issue was resolved when I added that configuration in that version. I'm not sure if it applies to the latest version though.

avkarenow commented 1 year ago

I had this problem with free account on serv00.com server reduction of a number of processes helps to run next build.

My next.config.js:

/** @type {import('next').NextConfig} */

const nextConfig = {
  experimental: {
    cpus: 1
  }
}
module.exports = nextConfig

My version: next@13.4.9 Node v16.20.0

AdamZajler commented 1 year ago

On 13.4.9 problem still exists (server is way more laggy, have to go back to 13.2.4)

pawelmidur commented 1 year ago

I have a question. Do you start the next server in pm2 with the "exec_mode": "cluster" configuration? Or as a single process?

billnbell commented 1 year ago

I did it as single.

iamstuxn3t commented 1 year ago

originally had jest-worker issue, downgraded to ### 13.2.3 as suggested above, jest-woker process is gone. However I am getting a different CPU spike from /.bin/next start as describe here https://github.com/vercel/next.js/discussions/49203

timneutkens commented 1 year ago

Since these issues are being confused, this particular issue is about processes being retained after the build exits. It does not refer to running in production and processes being spawned in that case, for production memory usage refer to this issue: #49929. On that issue I wrote down exactly what the 4 processes are: https://github.com/vercel/next.js/issues/49929#issuecomment-1637185156. Killing the processes randomly in production will cause your application to go down.

w7br commented 1 year ago

The problem also exists in version 13.2.4... I will revert back to version 13.1.6, as this version was still stable regarding memory leaks, although it was not lighter than 13.1.5.

It's frustrating not to have the confidence to start a project, as every version that fixes one bug brings dozens of others. There could be a truly stable version.

sedlukha commented 1 year ago

still see the problem with 13.4.10 and still too many jest-worker processes

next v.13.2.4 image

next v.13.4.10 image

edit: after some experiments, as I see, experimental.appDir: false can solve the issue, but there is another bug because of it https://github.com/vercel/next.js/issues/52875

S-YOU commented 1 year ago

FYI, latest version (13.4.12) does not spawn jest-worker anymore, but new dedicated processes for each renderer (not sure it is just renamed or not).

billnbell commented 1 year ago

OK lets try it.

hanoii commented 1 year ago

The processes rename happened on #52779, I wonder if the new releases fixes the high number of these processes

fab1an commented 1 year ago

Is anyone else concerned that these workers all open on all tcp-interfaces instead of just localhost? Meaning that they are exposed to the internet on a standard-VM.

Is this a security risk? My next.js process is started using /usr/bin/npm run start -- --port=XYZ --hostname=127.0.0.1, which works for the central service, but the workers just ignore this.

Is this a security risk?