Closed zqjimlove closed 1 year ago
I think I have the same issue.
I am using headless CMS Keystone on a virtual environment allowing a maximum of 250 processes in parallel.
I'm getting a lot of keystone-app/node_modules/next/dist/compiled/jest-worker/processChild.js
processes in parallel, which block the build of the app.
I'm trying to find a workaround.
In my case I get several ".../node_modules/next/dist/compiled/jest-worker/processChild.js" of these processes taking up lots of memory. I see these appear after executing "npm run start", and they disappear when I terminate the app (Ctrl+c). Not sure if or how this potentially relates to the build process.
We have also observed this issue in production, where it consumes memory that is likely not used or needed. This behavior was introduced in version 13.4.0
. There is an open discussion about this topic, which you can find at: https://github.com/vercel/next.js/discussions/49238.
We have the same problem, after few deployments server is going out of memory. As temporary fix i added the following script in the deployment pipeline :
#!/bin/bash
# Find the process IDs of all processes containing the string "processChild.js" in the command path
pids=$(pgrep -f "processChild.js")
# Iterate over each process ID and kill the corresponding process
for pid in $pids; do
echo "Killing process: $pid"
kill "$pid"
done
But even with this script seems that applications keep spawning zombie processes.
Seeing this as well in prod
Downgrading to <13.4.0
for now I guess
Merged this discussion into here: https://github.com/vercel/next.js/discussions/49238
This might be related: https://github.com/vercel/next.js/commit/83b774eeb69f1fe4f636260f83ed98c6d0717a3d#diff-90d1d5f446bdf243be25cc4ea2295a9c91508859d655e51d5ec4a3562d3a24d9L1930
Small favor, could you include a reproduction as a CodeSandbox instead of a zip
file?
We cannot recreate the issue with the provided information. Please add a reproduction in order for us to be able to investigate.
please add a complete reproduction
label?To be able to investigate, we need access to a reproduction to identify what triggered the issue. We prefer a link to a public GitHub repository (template for pages
, template for App Router), but you can also use these templates: CodeSandbox: pages
or CodeSandbox: App Router.
To make sure the issue is resolved as quickly as possible, please make sure that the reproduction is as minimal as possible. This means that you should remove unnecessary code, files, and dependencies that do not contribute to the issue.
Please test your reproduction against the latest version of Next.js (next@canary
) to make sure your issue has not already been fixed.
Ensure the link is pointing to a codebase that is accessible (e.g. not a private repository). "example.com", "n/a", "will add later", etc. are not acceptable links -- we need to see a public codebase. See the above section for accepted links.
Issues with the please add a complete reproduction
label that receives no meaningful activity (e.g. new comments with a reproduction link) are automatically closed and locked after 30 days.
If your issue has not been resolved in that time and it has been closed/locked, please open a new issue with the required reproduction.
Anyone experiencing the same issue is welcome to provide a minimal reproduction following the above steps. Furthermore, you can upvote the issue using the :+1: reaction on the topmost comment (please do not comment "I have the same issue" without reproduction steps). Then, we can sort issues by votes to prioritize.
We look into every Next.js issue and constantly monitor open issues for new comments.
However, sometimes we might miss one or two due to the popularity/high traffic of the repository. We apologize, and kindly ask you to refrain from tagging core maintainers, as that will usually not result in increased priority.
Upvoting issues to show your interest will help us prioritize and address them as quickly as possible. That said, every issue is important to us, and if an issue gets closed by accident, we encourage you to open a new one linking to the old issue and we will look into it.
I am commenting as a +1 to #49238 which I think more accurately described our issue. We only have 2 processChild.js processes, but this is likely due to running on GKE nodes with 2 CPUs. We run a minimum of 3 pods behind a service/load balancer. We unfortunately do not have a reproduction.
We were running 13.4.1 on node v16.19.0 in our production environment, and discovered that after some volume of requests or perhaps even period of time (as short as a day and a half, as long as 5 days), some next.js servers were becoming slow to unresponsive. New requests would take at least 5 seconds to process a response. CPU usage in the pod was maxed out, divided roughly at 33% user and 66% system. We discovered that requests are being proxied to a processChild.js
child process, which is listening on a different port (is this the new App Router?). We observed the following characteristics:
processChild.js
processstrace'ing showed the following signature over and over again with different sockets/URLs
...
write(1593, "GET /URL1"..., 2828) = -1 EAGAIN (Resource temporarily unavailable)
write(1600, "GET /URL2"..., 2833) = -1 EAGAIN (Resource temporarily unavailable)
epoll_wait(13, [{EPOLLOUT, {u32=266, u64=266}}, {EPOLLOUT, {u32=276, u64=276}}, {EPOLLOUT, {u32=280, u64=280}}, {EPOLLOUT, {u32=267, u64=267}}, {EPOLLOUT, {u32=315, u64=315}}, {EPOLLOUT, {u32=20, u64=20}}, {EPOLLOUT, {u32=322, u64=322}}, {EPOLLOUT, {u32=275, u64=275}}, {EPOLLOUT, {u32=325, u64=325}}, {EPOLLOUT, {u32=279, u64=279}}, {EPOLLOUT, {u32=332, u64=332}}, {EPOLLOUT, {u32=336, u64=336}}, {EPOLLOUT, {u32=314, u64=314}}, {EPOLLOUT, {u32=358, u64=358}}, {EPOLLOUT, {u32=324, u64=324}}, {EPOLLOUT, {u32=360, u64=360}}, {EPOLLOUT, {u32=281, u64=281}}, {EPOLLOUT, {u32=335, u64=335}}, {EPOLLOUT, {u32=296, u64=296}}, {EPOLLOUT, {u32=343, u64=343}}, {EPOLLOUT, {u32=377, u64=377}}, {EPOLLOUT, {u32=379, u64=379}}, {EPOLLOUT, {u32=359, u64=359}}, {EPOLLOUT, {u32=285, u64=285}}, {EPOLLOUT, {u32=268, u64=268}}, {EPOLLOUT, {u32=392, u64=392}}, {EPOLLOUT, {u32=366, u64=366}}, {EPOLLOUT, {u32=378, u64=378}}, {EPOLLOUT, {u32=406, u64=406}}, {EPOLLOUT, {u32=326, u64=326}}, {EPOLLOUT, {u32=323, u64=323}}, {EPOLLOUT, {u32=420, u64=420}}, ...], 1024, 0) = 275
write(266, "GET /URL3"..., 2839) = -1 EAGAIN (Resource temporarily unavailable)
write(276, "GET /URL4"..., 2830) = -1 EAGAIN (Resource temporarily unavailable)
write(280, "GET /URL5"..., 2825) = -1 EAGAIN (Resource temporarily unavailable)
...
It looks like the parent process continuously retries sending requests which are not being serviced/read into the child process. We're not sure what puts the server into this state (new requests will still be accepted and responded to slowly), but due to the unresponsiveness we downgraded back to 13.2.3.
I get next/dist/compiled/jest-worker/processChild.js running in NODE_ENV=production when running next start
??
Downgrading.
Downgrading
Hmm I don't know if downgrading helps @billnbell , I've seen this in our traces going back a few versions now, let me know if you have a specific version where this isn't an issue, I'm worried about memory utilisation as we're seeing it max out on our containers :)
Edit: just read above about < 13.4.0 ill give this a go and report back
Here to say me too. We’ve recently jumped on the 13.4 bandwagon and the last two weeks started to see memory maxing.
(Apologies, just read the bot asking me not to say this)
I just had a massive outage thanks to this it creeps up on you and doesn't die there's no way to easily kill the workers and also it stops build systems once it hits max ram
I can confirm downgrading worked for me. 13.2.3
maybe this will help too
git pull
npm ci || exit
BUILD_DIR=.nexttemp npm run build || exit
if [ ! -d ".nexttemp" ]; then\
echo '\033[31m .nexttemp Directory not exists!\033[0m'; \
exit 1; \
fi;
rm -rf .next
mv .nexttemp .next
pm2 reload all --update-env
echo "Deployment done."
seems like the jest worker is required or else pm2 can't serve the site on prod mode
the solution im using now is kill all and restart the service so it only makes 2 workers
I just had a massive outage thanks to this it creeps up on you and doesn't die there's no way to easily kill the workers and also it stops build systems once it hits max ram
This is freaky. We just did too!
We have 800 pages, some more than others have more than two api requests to build the page. We had a 1gb limit on our pods, upped to 2gb and has helped us.
I'm on 13.4.1
if that helps to debug
I'm on
13.4.1
if that helps to debug That is why I switched to 13.2.3. I have not tried newer versions or canary yet.
Just to be clear - we get out of memory in PRODUCTION mode when serving the site. I know others are seeing it when using next build
but we are getting this over time when. using next start
. Downgrading worked for us.
I don't really know why jest is running and eating all memory on the box. Can we add a parameter to turn off jest when running next start
?
@billnbell its not jest though right it’s the jest-worker package.
We even prune dev dependencies in production!
What is a jest-worker?
It’s a package. Which we presume is how the background tasks work for building. https://www.npmjs.com/package/jest-worker?activeTab=readme
The name jest-worker is actually confusing (at least for me) because of popular test framework jest, jest itself seems to be huge repository with a lot of packages, it should called facebook's web server / worker or something else.
there were over 3100 TCP connections established between the parent and processChild.js process
This kind of task perform best with unix sockets, and seems like Next.js team is not interested on unix sockets (PR closed: https://github.com/vercel/next.js/pull/20192). PR itself is about front side, but with current architecture, communication between parent and children with Unix sockets would be nice improvement (when available of course)
@billnbell app dir is not stable on that as far as I was aware. can you let me know if you use it ?
It’s a package. Which we presume is how the background tasks work for building. https://www.npmjs.com/package/jest-worker?activeTab=readme
Why would it be running in next start
mode? We are not building anymore.
@billnbell app dir is not stable on that as far as I was aware. can you let me know if you use it ?
What do you mean? What is app dir?
@billnbell There’s now two route approaches. The older of the two was the pages, the newer that seems to still need some love is the app folder.
We have the issue and use the older architectural approach - pages not app.
Re next start, we too do this in our container but also have isr, which would trigger a rebuild of the page should the age be older enough to rebuild and cache
We are using pages
It’s a package. Which we presume is how the background tasks work for building. https://www.npmjs.com/package/jest-worker?activeTab=readme
Why would it be running in
next start
mode? We are not building anymore.
This is what we see as well; there are jest-worker processChild.js
running at run time in production (I am not talking about build time). They seem to be used for rendering requested pages.
We are also having memory issues on app router with some apps just running out of memory after few hours and forcing restarts.
Can't really give much info here :/ But we are having similar issues since some versions
In my case, I had a component that made some socket connections, which caused the application to crash. PM2 will respawn the application in cluster mode without killing those processes so I ended up with three additional ghost processes every time the application crashed.
jest-worker seems to have option for using threads instead of processes,
enableWorkerThreads: boolean (optional) By default, jest-worker will use child_process threads to spawn new Node.js processes. If you prefer [worker_threads] (https://nodejs.org/api/worker_threads.html) instead, pass enableWorkerThreads: true.
May be having option to expose workerThreads in Next.js could mitigate some use cases? (I tried similar options in current Next.js experimental config, but didn't work (still child processes are spawned))
I can confirm downgrading worked for me.
13.2.3
@billnbell i'm running 13.3.4
which is below 13.4.x
and things seem ok with the jest-worker.
do you want to give that a try and confirm? otherwise ill try and drop down further to 13.2.3
This error lead up to make a 7-node Kubernetes deployment goes completely unresponsive randomly. Almost got nuts here.
I can confirm downgrading worked for me.
13.2.3
@billnbell i'm running
13.3.4
which is below13.4.x
and things seem ok with the jest-worker.do you want to give that a try and confirm? otherwise ill try and drop down further to
13.2.3
After our EC2 m6i.2xlarge went unresponsive on 13.4.2
we downgraded to 13.3.4
to fix the jest-worker
and processChild.js
processes
Our site uses the app folder routes. I'm not sure downgrading to 13.3.4 is an option.
it is not @ericmgoeken solution we used was to cull that process using the snip of @nicosh and do a restart on a cron tab at random intervals ^ This is not a fix and I frankly hate we have to resort to this
as we have zero downtime deployments with autoscaling it kinda handles this abuse
Is the consensus is 13.3.4 works ?
Yeah, I even got the app route working on 13.3.4
Yeah, I even got the app route working on 13.3.4
@ericmgoeken can you put here how with any code snippets which may help @BuddhiAbeyratne @BuddhiAbeyratneAL and anyone else run app routing for 13.3.4?
@t0mtaylor, I just changed the package.json to 13.3.4 and added the experimental config
experimental:{
appDir: true
},
Also, to add context to the issue. For my project it would go out of control from npm start. If I reboot the server it go out of control again. So, you need to do npm build before npm start, but its not the build command that caused the issue for me.
What is the minimum amount of RAM required for the processChild.js process? Has anyone had experience adding more RAM?
What is the minimum amount of RAM required for the processChild.js process? Has anyone had experience adding more RAM?
We had 1gb in our pods. We upped to 2gb. This prevented the freak out the Friday before last where our pods just kept rebooting and scales to 10. They’d always peak on spin up, as the cache was cold. We are looking into sharing the cache between pods to help with that.
We dropped to 13.2.3 the following week as our pods never scaled back. Since the revert they were happy (but we all know this)
@billnbell it seems like 13.2.4 is the most recent version that doesn't create these jest-worker processes.
I have a project deployed on AWS and we had upgraded to 13.4 to use Next/fonts on production. But, after the deploy the jest-worker processes kill our website with only 300 concurrent users. AWS autoscaling went into create until 10 new images to balance the load in minutes! Crazy!
We solved it, for now, going down to 13.2.4
13.3 didn't resolve the problem. We don't use App directory yet
Seems relevant: https://github.com/vercel/next.js/pull/51271 - seems if appdir was enabled, it used half the number of workers and this impacted build performance? This fix looks like it will be in 13.4.6
Verify canary release
Provide environment information
Which area(s) of Next.js are affected? (leave empty if unsure)
CLI (create-next-app)
Link to the code that reproduces this issue
https://github.com/vercel/next.js/files/10565355/reproduce.zip
To Reproduce
reproduce.zip
This problem can reproduce above next@12.0.9, but 12.0.8 was all right.
Or remove
getInitialProps
in_app.tsx
was all right above next@12.0.9.Describe the Bug
Hight number of processes of /next/dist/compiled/jest-worker/processChild.js still alive after next build
Expected Behavior
Kill all child processes.
Which browser are you using? (if relevant)
No response
How are you deploying your application? (if relevant)
No response
NEXT-1348