vercel / next.js

The React Framework
https://nextjs.org
MIT License
122.59k stars 26.24k forks source link

Next >12.0.1 causes memory leak after upgrading from Next 11 #36703

Closed jflayhart closed 1 year ago

jflayhart commented 2 years ago

Verify canary release

Provide environment information

$ next info
(node:39290) [DEP0018] DeprecationWarning: Unhandled promise rejections are deprecated. In the future, promise rejections that are not handled will terminate the Node.js process with a non-zero exit code.

What browser are you using? (if relevant)

Chrome 100

How are you deploying your application? (if relevant)

custom express server running on GCP distributed across multiple containers running 1GB memory

Describe the Bug

We're in a bit of a catch22 because we tried to upgrade to 12.0.0, which works, but then our Lighthouse scores started failing due to a security vulnerability in 12.0.0. I don't know if y'all ever fixed that but something was done in 12.0.5 it seems based on the change log, so we tried that, which created another issue: memory leaks.

After upgrading to 12.0.5 a new issue occurred where the prod optimized build app simply crashes due to OOM errors:

<--- Last few GCs --->
12:22:59.957 dev-dev-uc1-gke next-4072-webserver
12:22:59.958 dev-dev-uc1-gke next-4072-webserver [27:0x7f24d13882c0]   755827 ms: Mark-sweep (reduce) 770.6 (773.1) -> 770.3 (774.9) MB, 998.9 / 0.0 ms  (average mu = 0.100, current mu = 0.001) allocation failure scavenge might not succeed
12:22:59.958 dev-dev-uc1-gke next-4072-webserver [27:0x7f24d13882c0]   756837 ms: Mark-sweep (reduce) 771.3 (776.9) -> 771.1 (777.6) MB, 1009.1 / 0.0 ms  (average mu = 0.052, current mu = 0.001) allocation failure scavenge might not succeed
12:22:59.958 dev-dev-uc1-gke next-4072-webserver
12:22:59.958 dev-dev-uc1-gke next-4072-webserver
12:22:59.958 dev-dev-uc1-gke next-4072-webserver <--- JS stacktrace --->
12:22:59.958 dev-dev-uc1-gke next-4072-webserver
12:22:59.958 dev-dev-uc1-gke next-4072-webserver FATAL ERROR: Ineffective mark-compacts near heap limit Allocation failed - JavaScript heap out of memory
12:23:00.016 dev-dev-uc1-gke next-4072-webserver error Command failed with signal "SIGABRT".

I tried 12.1 and even canary releases to see if anything worked itself out. Same problem. I tracked down a memory leak with growing Promises and Arrays within webpack runtime and cjs-loader and tslib (whatever that means) when building prod build locally and running node --inspect:

image

Screen Shot 2022-05-03 at 1 52 29 PM

In the end, I found 12.0.1 to be the only version after 12.0.0 with no leak; going to 12.0.2 starts the memory leak and crashes the app in the same way.

The prime suspects are the following deps we have installed based on their interaction with promises and arrays:

Already tried upgrading node to latest LTS (16), still doesn't work. Local dev runs "fine" albeit slow, but local builds don't crash. Only prod optimized builds

Expected Behavior

App is expected to NOT crash, so I upgraded my personal website to 12.1, which is hosted on Vercel and has less dependencies than the enterprise, e-commerce site I am working on. Well, it is fine and running great, so makes me think the issue is more complex possibly around the use of a custom server in the enterprise app and/or Next12 isn't playing nice with some node deps.

When I ran this in 12.0.0 and 12.0.1 the memory leak is non-existent:

Screen Shot 2022-05-03 at 3 30 33 PM

It is not until 12.0.2 that we experience the memory leak! and I've gone through your change logs and nothing looks suspicious.

To Reproduce

Unable to reproduce with custom server next example, so it's likely because we're using an enterprise app and something (ie other dependencies) isn't playing nice with next12 and would like to understand what could be causing that.

I can only assume something around SWC would have changed the build process so when it runs in node it gets wonky? Just speculating at this point because we can't upgrade to next12 at all.

But in our enterprise app, I simply do the following to reproduce consistently every time:

jflayhart commented 2 years ago

Curious if there are any known dependencies that Next requires to be updated for Next12+? What changed specifically in >=12.0.2 that could be causing growing Promises and array objects on a node server?

I could provide our package.json deps if that helps, but I am not exactly sure if it would be the fault of prod deps or devDeps due to the build process possibly causing issue.

jflayhart commented 2 years ago

Possibly related

As a temporary workaround you can disable esmExternals

What does this mean exactly? What is this disabling as a workaround? I'd prefer a real fix rather than a workaround :D What are the tradeoffs with disabling esmExternals?

jflayhart commented 2 years ago

@timneutkens sorry to directly ping you, but es-modules-support definitely seems to be the culprit and I don't understand why. Could you please shed some light onto what changed in Next12? Y'all said Next12 is "backwards compatible" but it seems that is not the case with enabling by default instead of opt-in:

ES Modules support includes backward compatibility to support the previous import behavior of CommonJS. In Next.js 12, esmExternals: true will become the default

That doesn't seem like a backwards compat change given the amount of mem leaks people are running into šŸ¤·šŸ» But maybe I am misunderstanding. Setting to false "works" so I can now run Next >12.0.1 (including latest), but is that the right thing to do? Again, what are we disabling exactly and why do we want to do this? I don't want to do this just cause "now it works" I'd like to understand any tradeoffs. Thanks!

timneutkens commented 2 years ago

given the amount of mem leaks people are running into

As far as we've seen on the other issues the webpack config was customized through next-transpile-modules which is not officially supported as it's a community customization of webpack config, could you confirm you're also using next-transpile-modules? If you have a reproduction that'd be great given that the other issues don't have a reproduction provided which makes it impossible to investigate further, so far my suspicion is that next-transpile-modules has a memory leak but we couldn't reproduce it in a new app so it might be related to specific combination of imports.

We're planning to build something like next-transpile-modules into Next.js to have an official supported way to do something like that: #35150

Y'all said Next12 is "backwards compatible" but it seems that is not the case with enabling by default instead of opt-in:

It is backwards compatible. It passes all integration tests and we test Next.js against the internal apps that Vercel has like the vercel.com app. These are among the largest apps we've seen to date. What you can't expect from customizing webpack config is that your customization of the build pipeline is backwards compatible between versions as you get full control over the webpack config.

What are the tradeoffs with disabling esmExternals?

Resolving of ESM packages in node_modules will be incorrect / not work. It won't resolve packages with "type": "module". More here: https://nextjs.org/blog/next-11-1#es-modules-support

jflayhart commented 2 years ago

@timneutkens awesome thanks so much for that explanation! I trust that y'all have the right tests in place and it's hard to know what enterprise, custom apps are doing with your framework. I get it.

We do NOT use next-transpile-modules

Considering that the esmExternals: false change seems to help most people out of this OOM issue, including me, should we add it to docs as a last resort "backwards compat" config change? For example, in Next12 upgrade guide:

"if you experience [this OOM error] first try setting esmExternals: false since it was turned on by default in Next12. Doing this will...."?

What do you think?

jflayhart commented 2 years ago

If you have a reproduction that'd be great given that the other issues don't have a reproduction provided which makes it impossible to investigate further, so far my suspicion is that next-transpile-modules has a memory leak but we couldn't reproduce it in a new app so it might be related to specific combination of imports.

I've spent so much time already, so I may as well work on a solid reproduction outside a large enterprise app šŸ˜… The problem is I don't know what in our npm spaghetti is causing the conflict with Next12, so hard to reproduce with only a handful of deps in an isolated sandbox. Will try my best...

timneutkens commented 2 years ago

@timneutkens awesome thanks so much for that explanation! I trust that y'all have the right tests in place and it's hard to know what enterprise, custom apps are doing with your framework. I get it.

Considering that the esmExternals: false change seems to help most people out of this OOM issue, should we add it to docs as a last resort "backwards compat" config change? For example, in Next12 upgrade guide:

"if you experience [this OOM error] first try setting esmExternals: false since it was turned on by default in Next12. Doing this will...."?

What do you think?

I'd prefer to fix the OOM even if it's not specifically in Next.js through a combination of deps. That flag will definitely go away in the future. Main problem so far is that there is no reproduction provided that would allow us to spend time on it as I'd be happy to have someone on our team look at it if one is provided.

I've spent so much time already, so I may as well work on a solid reproduction outside a large enterprise app šŸ˜… The problem is I don't know what in our npm spaghetti is causing the conflict with Next12, so hard to reproduce with only a handful of deps in an isolated sandbox. Will try my best...

That'd be great! šŸ™

jflayhart commented 2 years ago

@timneutkens I can start to reproduce what may eventually lead to a OOM server kill in my sandbox env, but it is so small of an app I can't get it to crash (yet). It makes sense why a large, enterprise site would crash more quickly due to its size and need for more resources.

That being said, the memory retained size trends are the same in heap snapshots between our proprietary enterprise site and my personal sandbox site, where obviously the latter is far less resource intensive. But I have reproduced the same memory leak!

To reproduce, I installed deps based on what I assumed to be the main offenders from preliminary R&D on the enterprise site:

  1. checkout https://github.com/jflayhart/website/tree/repro-oom
  2. yarn install
  3. yarn build
  4. yarn start
  5. Use Chrome and go to http://localhost:8080 and use chrome://inspect to open node DevTools
  6. Take heap snapshot
  7. refresh http://localhost:8080 a ton of times (eg for mac users, hold down cmd+r for about 30s)
  8. Take heap snapshot
  9. repeat steps 7-8 and even wait longer holding down the refresh hotkey to rapid refresh the site
  10. You will see some constructors grow over time; Promises almost exponentially

At its core, my sandboxed env is barebones version of our enterprise app, where the memory leak is mainly around Promises being held in the global namespace:

Screen Shot 2022-05-05 at 3 16 08 PM Screen Shot 2022-05-05 at 3 16 22 PM Screen Shot 2022-05-05 at 3 16 59 PM

You can imagine this is orders of magnitude worse for an enterprise app where, assuming the culprit is styled-components, there are thousands more lines of css-in-jss.

This is all a working theory, but it is fact that Promises and other constructors are retaining memory over time (see promises start growing up to the top in the last screenshot):

Screen Shot 2022-05-05 at 3 24 11 PM Screen Shot 2022-05-05 at 3 24 24 PM

Could this mean something is wrong with _document's initialProps usage with styled-component? My working theory is this is part of the issue

UPDATE: Setting esmExternals: false on sandbox project still shows growing retained memory size, but at a MUCH slower rate and after a while most of it gets GCd correctly:

Screen Shot 2022-05-06 at 12 49 30 PM
nol13 commented 2 years ago

not much additional info to provide, but we had to downgrade from 12.1.4 to 12.0.7 to fix a memory leak crash in the node process. can say that it was unrelated to which compiler we used, tried both. 12.0.7 not having the issue though

jchatrny commented 2 years ago

I've just tested and downgrading to 12.0.1 with babel fallback didn't solve issue for us. Issue described in #32314 still occurs (FATAL ERROR: Reached heap limit Allocation failed - JavaScript heap out of memory).

For now we use dynamic imports for reducers as workaround.

jflayhart commented 2 years ago

not much additional info to provide, but we had to downgrade from 12.1.4 to 12.0.7 to fix a memory leak crash in the node process. can say that it was unrelated to which compiler we used, tried both. 12.0.7 not having the issue though

@nol13 could you please list out your deps as well? I'm trying to see if there's a pattern here, for example, do you have styled-components perhaps? Axios? react-query? What could be causing Node's Promises leak (eg. https://github.com/nodejs/node/issues/34328)?

nol13 commented 2 years ago

none of those

"next": "12.0.7", "next-connect": "0.10.1", "react": "17.0.2", "react-redux": "7.2.4", "request": "^2.88.2", "send": "^0.17.1", "async-retry": "1.2.3",

jflayhart commented 2 years ago

Hmm is redux the common thread here? @jchatrny what do you mean when you say "we use dynamic imports for reducers as workaround"?

hypeJunction commented 2 years ago

I have been fighting with heap memory issues for hours now, and nothing seemed to help. It started randomly this morning, and have been driving me crazy since. The only thing that helped is wiping .next directory and starting the server again. Maybe it helps someone with the knowledge of internals to figure out what's happening. Seems that webpack is trying to do something it shouldn't. I tried to inspect with node --inspect, but the traces were quite useless.

LuudJanssen commented 2 years ago

We are also experiencing this issue on our enterprise grade app, where we see exactly the same behavior as @jflayhart. When trying to reproduce it on a small project, the node process doesn't get killed, because it never reaches the critical memory point. However, for our large project, memory usage skyrockets to about 5GB before ending on heap errors.

Turning off esmExternals solved the issue.

It seems to be happening after the compilation has been done (the server logs that the compilation of the page has been done).

These are the dependencies we use:

  1. next 12.1.5
  2. typescript 4.6.3
  3. react 17.0.2
  4. @chakra-ui/react 1.8.8
  5. @chakra-ui/theme 1.14.1
  6. @chakra-ui/theme-tools 1.3.6
  7. next-i18next 11.0.0
  8. Joi 17.6.0
  9. dayjs 1.11.1
  10. lottie-react 2.2.1
  11. lottie-web 5.9.2
  12. @emotion/react 11.9.0
  13. @emotion/styled 11.8.1
  14. framer-motion 6.3.0
  15. wretch 1.7.9
  16. winston 3.7.2
  17. ... others
balazsorban44 commented 2 years ago

Hi, we recently landed some changes (https://github.com/vercel/next.js/pull/37397) to canary that might help fix this issue, without setting esmExternals: false.

Please try it out by installing next@canary and let us know!

It seemed to help a lot of people in this thread: https://github.com/vercel/next.js/issues/32314

LuudJanssen commented 2 years ago

@balazsorban44 seems to have fixed the issue! I'll check later for a 100% confirm, because I had to disable middleware because of breaking changes in the latest canary. If the memory leak was because of the middleware, it might not have been fixed.

balazsorban44 commented 2 years ago

Thanks for checking. I would say the memory issue is unlikely to be middleware related, but happy to get a 100% confirmation. šŸ‘

@jflayhart could you check canary as well?

glitteringkatie commented 2 years ago

I was lurking on here and want to confirm that switching to canary seems to have gotten rid of the memory leak in our enterprise-grade app as well! šŸŽ‰

jflayhart commented 2 years ago

@balazsorban44 I just pulled down v12.1.7-canary.42 and it's working great! Not seeing the same memory leaks šŸŽ‰ Certainly not around Promises. Great work!

Curious what was the exact issue in webpack (presuming since that was updated)?

Screen Shot 2022-06-21 at 12 06 25 PM
glitteringkatie commented 2 years ago

I was lurking on here and want to confirm that switching to canary seems to have gotten rid of the memory leak in our enterprise-grade app as well! šŸŽ‰

Follow up to my lurk, turns out the memory leak just changed šŸ˜ž We are using next in 2 repos: one with a lot of files and one with LONG files. Prior to the canary, long files had the memory leak and lots of files did not, whereas with the canary our repo with long files is now fine and our lots of files repo is getting OOM. I know this is a vague report but wanted to pass this insight along before we file an issue through vercel!

nol13 commented 1 year ago

This issues is (tentatively) fixed for us in 12.2.6-canary.8. Guessing it might be https://github.com/vercel/next.js/pull/39902 that did the trick

moshie commented 1 year ago

Confirmed here early reports are showing that 12.3.0 has fixed the leak our end šŸŽ‰

nol13 commented 1 year ago

Cool we might have to test out 12.3.0 then. Was worried we'd have to go with the canary since #39902 got reverted.

balazsorban44 commented 1 year ago

Based on multiple reports, going to close this for now. Please open a new issue (testing next@canary) with a new reproduction if you have similar issues.

github-actions[bot] commented 1 year ago

This closed issue has been automatically locked because it had no new activity for a month. If you are running into a similar issue, please create a new issue with the steps to reproduce. Thank you.