vercel / next.js

The React Framework
https://nextjs.org
MIT License
127.51k stars 27.04k forks source link

Hight number of processes of /next/dist/compiled/jest-worker/processChild.js still alive after next build #45508

Closed zqjimlove closed 1 year ago

zqjimlove commented 1 year ago

Verify canary release

Provide environment information

Operating System:
  Platform: darwin
  Arch: arm64
  Version: Darwin Kernel Version 22.3.0: Thu Jan  5 20:48:54 PST 2023; root:xnu-8792.81.2~2/RELEASE_ARM64_T6000
Binaries:
  Node: 18.13.0
  npm: 8.19.3
  Yarn: 1.22.19
  pnpm: 7.26.2
Relevant packages:
  next: 12.0.9
  react: 17.0.2
  react-dom: 17.0.2

Which area(s) of Next.js are affected? (leave empty if unsure)

CLI (create-next-app)

Link to the code that reproduces this issue

https://github.com/vercel/next.js/files/10565355/reproduce.zip

To Reproduce

reproduce.zip

image

This problem can reproduce above next@12.0.9, but 12.0.8 was all right.

Or remove getInitialProps in _app.tsx was all right above next@12.0.9.

// GlobalApp.getInitialProps = async function getInitialProps(appContext) {
//   const appProps = await App.getInitialProps(appContext);

//   return {
//     ...appProps,
//   };
// };

Describe the Bug

Hight number of processes of /next/dist/compiled/jest-worker/processChild.js still alive after next build

Expected Behavior

Kill all child processes.

Which browser are you using? (if relevant)

No response

How are you deploying your application? (if relevant)

No response

NEXT-1348

Francoois commented 1 year ago

I think I have the same issue. I am using headless CMS Keystone on a virtual environment allowing a maximum of 250 processes in parallel. I'm getting a lot of keystone-app/node_modules/next/dist/compiled/jest-worker/processChild.js processes in parallel, which block the build of the app. I'm trying to find a workaround.

yzubkov commented 1 year ago

In my case I get several ".../node_modules/next/dist/compiled/jest-worker/processChild.js" of these processes taking up lots of memory. I see these appear after executing "npm run start", and they disappear when I terminate the app (Ctrl+c). Not sure if or how this potentially relates to the build process.

schorfES commented 1 year ago

We have also observed this issue in production, where it consumes memory that is likely not used or needed. This behavior was introduced in version 13.4.0. There is an open discussion about this topic, which you can find at: https://github.com/vercel/next.js/discussions/49238.

nicosh commented 1 year ago

We have the same problem, after few deployments server is going out of memory. As temporary fix i added the following script in the deployment pipeline :

#!/bin/bash

# Find the process IDs of all processes containing the string "processChild.js" in the command path
pids=$(pgrep -f "processChild.js")

# Iterate over each process ID and kill the corresponding process
for pid in $pids; do
    echo "Killing process: $pid"
    kill "$pid"
done

But even with this script seems that applications keep spawning zombie processes.

switz commented 1 year ago

Seeing this as well in prod

MonstraG commented 1 year ago

Downgrading to <13.4.0 for now I guess

leerob commented 1 year ago

Merged this discussion into here: https://github.com/vercel/next.js/discussions/49238

This might be related: https://github.com/vercel/next.js/commit/83b774eeb69f1fe4f636260f83ed98c6d0717a3d#diff-90d1d5f446bdf243be25cc4ea2295a9c91508859d655e51d5ec4a3562d3a24d9L1930

leerob commented 1 year ago

Small favor, could you include a reproduction as a CodeSandbox instead of a zip file?

github-actions[bot] commented 1 year ago

We cannot recreate the issue with the provided information. Please add a reproduction in order for us to be able to investigate.

Why was this issue marked with the please add a complete reproduction label?

To be able to investigate, we need access to a reproduction to identify what triggered the issue. We prefer a link to a public GitHub repository (template for pages, template for App Router), but you can also use these templates: CodeSandbox: pages or CodeSandbox: App Router.

To make sure the issue is resolved as quickly as possible, please make sure that the reproduction is as minimal as possible. This means that you should remove unnecessary code, files, and dependencies that do not contribute to the issue.

Please test your reproduction against the latest version of Next.js (next@canary) to make sure your issue has not already been fixed.

I added a link, why was it still marked?

Ensure the link is pointing to a codebase that is accessible (e.g. not a private repository). "example.com", "n/a", "will add later", etc. are not acceptable links -- we need to see a public codebase. See the above section for accepted links.

What happens if I don't provide a sufficient minimal reproduction?

Issues with the please add a complete reproduction label that receives no meaningful activity (e.g. new comments with a reproduction link) are automatically closed and locked after 30 days.

If your issue has not been resolved in that time and it has been closed/locked, please open a new issue with the required reproduction.

I did not open this issue, but it is relevant to me, what can I do to help?

Anyone experiencing the same issue is welcome to provide a minimal reproduction following the above steps. Furthermore, you can upvote the issue using the :+1: reaction on the topmost comment (please do not comment "I have the same issue" without reproduction steps). Then, we can sort issues by votes to prioritize.

I think my reproduction is good enough, why aren't you looking into it quicker?

We look into every Next.js issue and constantly monitor open issues for new comments.

However, sometimes we might miss one or two due to the popularity/high traffic of the repository. We apologize, and kindly ask you to refrain from tagging core maintainers, as that will usually not result in increased priority.

Upvoting issues to show your interest will help us prioritize and address them as quickly as possible. That said, every issue is important to us, and if an issue gets closed by accident, we encourage you to open a new one linking to the old issue and we will look into it.

Useful Resources

bfife-bsci commented 1 year ago

I am commenting as a +1 to #49238 which I think more accurately described our issue. We only have 2 processChild.js processes, but this is likely due to running on GKE nodes with 2 CPUs. We run a minimum of 3 pods behind a service/load balancer. We unfortunately do not have a reproduction.

We were running 13.4.1 on node v16.19.0 in our production environment, and discovered that after some volume of requests or perhaps even period of time (as short as a day and a half, as long as 5 days), some next.js servers were becoming slow to unresponsive. New requests would take at least 5 seconds to process a response. CPU usage in the pod was maxed out, divided roughly at 33% user and 66% system. We discovered that requests are being proxied to a processChild.js child process, which is listening on a different port (is this the new App Router?). We observed the following characteristics:

strace'ing showed the following signature over and over again with different sockets/URLs

...
write(1593, "GET /URL1"..., 2828) = -1 EAGAIN (Resource temporarily unavailable)
write(1600, "GET /URL2"..., 2833) = -1 EAGAIN (Resource temporarily unavailable)
epoll_wait(13, [{EPOLLOUT, {u32=266, u64=266}}, {EPOLLOUT, {u32=276, u64=276}}, {EPOLLOUT, {u32=280, u64=280}}, {EPOLLOUT, {u32=267, u64=267}}, {EPOLLOUT, {u32=315, u64=315}}, {EPOLLOUT, {u32=20, u64=20}}, {EPOLLOUT, {u32=322, u64=322}}, {EPOLLOUT, {u32=275, u64=275}}, {EPOLLOUT, {u32=325, u64=325}}, {EPOLLOUT, {u32=279, u64=279}}, {EPOLLOUT, {u32=332, u64=332}}, {EPOLLOUT, {u32=336, u64=336}}, {EPOLLOUT, {u32=314, u64=314}}, {EPOLLOUT, {u32=358, u64=358}}, {EPOLLOUT, {u32=324, u64=324}}, {EPOLLOUT, {u32=360, u64=360}}, {EPOLLOUT, {u32=281, u64=281}}, {EPOLLOUT, {u32=335, u64=335}}, {EPOLLOUT, {u32=296, u64=296}}, {EPOLLOUT, {u32=343, u64=343}}, {EPOLLOUT, {u32=377, u64=377}}, {EPOLLOUT, {u32=379, u64=379}}, {EPOLLOUT, {u32=359, u64=359}}, {EPOLLOUT, {u32=285, u64=285}}, {EPOLLOUT, {u32=268, u64=268}}, {EPOLLOUT, {u32=392, u64=392}}, {EPOLLOUT, {u32=366, u64=366}}, {EPOLLOUT, {u32=378, u64=378}}, {EPOLLOUT, {u32=406, u64=406}}, {EPOLLOUT, {u32=326, u64=326}}, {EPOLLOUT, {u32=323, u64=323}}, {EPOLLOUT, {u32=420, u64=420}}, ...], 1024, 0) = 275
write(266, "GET /URL3"..., 2839) = -1 EAGAIN (Resource temporarily unavailable)
write(276, "GET /URL4"..., 2830) = -1 EAGAIN (Resource temporarily unavailable)
write(280, "GET /URL5"..., 2825) = -1 EAGAIN (Resource temporarily unavailable)
...

It looks like the parent process continuously retries sending requests which are not being serviced/read into the child process. We're not sure what puts the server into this state (new requests will still be accepted and responded to slowly), but due to the unresponsiveness we downgraded back to 13.2.3.

billnbell commented 1 year ago

I get next/dist/compiled/jest-worker/processChild.js running in NODE_ENV=production when running next start ??

Downgrading.

csi-lk commented 1 year ago

Downgrading

Hmm I don't know if downgrading helps @billnbell , I've seen this in our traces going back a few versions now, let me know if you have a specific version where this isn't an issue, I'm worried about memory utilisation as we're seeing it max out on our containers :)

Edit: just read above about < 13.4.0 ill give this a go and report back

cjcheshire commented 1 year ago

Here to say me too. We’ve recently jumped on the 13.4 bandwagon and the last two weeks started to see memory maxing.

(Apologies, just read the bot asking me not to say this)

BuddhiAbeyratne commented 1 year ago

I just had a massive outage thanks to this it creeps up on you and doesn't die there's no way to easily kill the workers and also it stops build systems once it hits max ram

billnbell commented 1 year ago

I can confirm downgrading worked for me. 13.2.3

BuddhiAbeyratne commented 1 year ago

maybe this will help too

    git pull
    npm ci || exit
    BUILD_DIR=.nexttemp npm run build || exit
    if [ ! -d ".nexttemp" ]; then\
            echo '\033[31m .nexttemp Directory not exists!\033[0m'; \
            exit 1; \
    fi;
    rm -rf .next
    mv .nexttemp .next
    pm2 reload all --update-env
    echo "Deployment done."
BuddhiAbeyratne commented 1 year ago

seems like the jest worker is required or else pm2 can't serve the site on prod mode

BuddhiAbeyratne commented 1 year ago

the solution im using now is kill all and restart the service so it only makes 2 workers

cjcheshire commented 1 year ago

I just had a massive outage thanks to this it creeps up on you and doesn't die there's no way to easily kill the workers and also it stops build systems once it hits max ram

This is freaky. We just did too!

We have 800 pages, some more than others have more than two api requests to build the page. We had a 1gb limit on our pods, upped to 2gb and has helped us.

BuddhiAbeyratne commented 1 year ago

I'm on 13.4.1 if that helps to debug

billnbell commented 1 year ago

I'm on 13.4.1 if that helps to debug That is why I switched to 13.2.3. I have not tried newer versions or canary yet.

billnbell commented 1 year ago

Just to be clear - we get out of memory in PRODUCTION mode when serving the site. I know others are seeing it when using next build but we are getting this over time when. using next start. Downgrading worked for us.

I don't really know why jest is running and eating all memory on the box. Can we add a parameter to turn off jest when running next start ?

cjcheshire commented 1 year ago

@billnbell its not jest though right it’s the jest-worker package.

We even prune dev dependencies in production!

billnbell commented 1 year ago

What is a jest-worker?

cjcheshire commented 1 year ago

It’s a package. Which we presume is how the background tasks work for building. https://www.npmjs.com/package/jest-worker?activeTab=readme

S-YOU commented 1 year ago

The name jest-worker is actually confusing (at least for me) because of popular test framework jest, jest itself seems to be huge repository with a lot of packages, it should called facebook's web server / worker or something else.

S-YOU commented 1 year ago

there were over 3100 TCP connections established between the parent and processChild.js process

This kind of task perform best with unix sockets, and seems like Next.js team is not interested on unix sockets (PR closed: https://github.com/vercel/next.js/pull/20192). PR itself is about front side, but with current architecture, communication between parent and children with Unix sockets would be nice improvement (when available of course)

BuddhiAbeyratne commented 1 year ago

@billnbell app dir is not stable on that as far as I was aware. can you let me know if you use it ?

billnbell commented 1 year ago

It’s a package. Which we presume is how the background tasks work for building. https://www.npmjs.com/package/jest-worker?activeTab=readme

Why would it be running in next start mode? We are not building anymore.

billnbell commented 1 year ago

@billnbell app dir is not stable on that as far as I was aware. can you let me know if you use it ?

What do you mean? What is app dir?

cjcheshire commented 1 year ago

@billnbell There’s now two route approaches. The older of the two was the pages, the newer that seems to still need some love is the app folder.

We have the issue and use the older architectural approach - pages not app.

Re next start, we too do this in our container but also have isr, which would trigger a rebuild of the page should the age be older enough to rebuild and cache

billnbell commented 1 year ago

We are using pages

bfife-bsci commented 1 year ago

It’s a package. Which we presume is how the background tasks work for building. https://www.npmjs.com/package/jest-worker?activeTab=readme

Why would it be running in next start mode? We are not building anymore.

This is what we see as well; there are jest-worker processChild.js running at run time in production (I am not talking about build time). They seem to be used for rendering requested pages.

HamAndRock commented 1 year ago

We are also having memory issues on app router with some apps just running out of memory after few hours and forcing restarts.

Can't really give much info here :/ But we are having similar issues since some versions

nicosh commented 1 year ago

In my case, I had a component that made some socket connections, which caused the application to crash. PM2 will respawn the application in cluster mode without killing those processes so I ended up with three additional ghost processes every time the application crashed.

S-YOU commented 1 year ago

jest-worker seems to have option for using threads instead of processes,

https://github.com/jestjs/jest/blob/main/packages/jest-worker/README.md#enableworkerthreads-boolean-optional

enableWorkerThreads: boolean (optional) By default, jest-worker will use child_process threads to spawn new Node.js processes. If you prefer [worker_threads] (https://nodejs.org/api/worker_threads.html) instead, pass enableWorkerThreads: true.

May be having option to expose workerThreads in Next.js could mitigate some use cases? (I tried similar options in current Next.js experimental config, but didn't work (still child processes are spawned))

t0mtaylor commented 1 year ago

I can confirm downgrading worked for me. 13.2.3

@billnbell i'm running 13.3.4 which is below 13.4.x and things seem ok with the jest-worker.

do you want to give that a try and confirm? otherwise ill try and drop down further to 13.2.3

DenisBessa commented 1 year ago

This error lead up to make a 7-node Kubernetes deployment goes completely unresponsive randomly. Almost got nuts here.

christopherbowers commented 1 year ago

I can confirm downgrading worked for me. 13.2.3

@billnbell i'm running 13.3.4 which is below 13.4.x and things seem ok with the jest-worker.

do you want to give that a try and confirm? otherwise ill try and drop down further to 13.2.3

After our EC2 m6i.2xlarge went unresponsive on 13.4.2 we downgraded to 13.3.4 to fix the jest-worker and processChild.js processes

ericmgoeken commented 1 year ago

Our site uses the app folder routes. I'm not sure downgrading to 13.3.4 is an option.

BuddhiAbeyratneAL commented 1 year ago

it is not @ericmgoeken solution we used was to cull that process using the snip of @nicosh and do a restart on a cron tab at random intervals ^ This is not a fix and I frankly hate we have to resort to this

as we have zero downtime deployments with autoscaling it kinda handles this abuse

billnbell commented 1 year ago

Is the consensus is 13.3.4 works ?

ericmgoeken commented 1 year ago

Yeah, I even got the app route working on 13.3.4

t0mtaylor commented 1 year ago

Yeah, I even got the app route working on 13.3.4

@ericmgoeken can you put here how with any code snippets which may help @BuddhiAbeyratne @BuddhiAbeyratneAL and anyone else run app routing for 13.3.4?

ericmgoeken commented 1 year ago

@t0mtaylor, I just changed the package.json to 13.3.4 and added the experimental config

  experimental:{
    appDir: true
  },

Also, to add context to the issue. For my project it would go out of control from npm start. If I reboot the server it go out of control again. So, you need to do npm build before npm start, but its not the build command that caused the issue for me.

faridvatani commented 1 year ago

What is the minimum amount of RAM required for the processChild.js process? Has anyone had experience adding more RAM?

cjcheshire commented 1 year ago

What is the minimum amount of RAM required for the processChild.js process? Has anyone had experience adding more RAM?

We had 1gb in our pods. We upped to 2gb. This prevented the freak out the Friday before last where our pods just kept rebooting and scales to 10. They’d always peak on spin up, as the cache was cold. We are looking into sharing the cache between pods to help with that.

We dropped to 13.2.3 the following week as our pods never scaled back. Since the revert they were happy (but we all know this)

redaxle commented 1 year ago

@billnbell it seems like 13.2.4 is the most recent version that doesn't create these jest-worker processes.

betoquiroga commented 1 year ago

I have a project deployed on AWS and we had upgraded to 13.4 to use Next/fonts on production. But, after the deploy the jest-worker processes kill our website with only 300 concurrent users. AWS autoscaling went into create until 10 new images to balance the load in minutes! Crazy!

We solved it, for now, going down to 13.2.4

13.3 didn't resolve the problem. We don't use App directory yet

cannontrodder commented 1 year ago

Seems relevant: https://github.com/vercel/next.js/pull/51271 - seems if appdir was enabled, it used half the number of workers and this impacted build performance? This fix looks like it will be in 13.4.6