medusajs / medusa

Building blocks for digital commerce
https://medusajs.com
MIT License
24.77k stars 2.45k forks source link

Server crashes due to OOM, possible memory leak. #1651

Closed olavurdj closed 1 year ago

olavurdj commented 2 years ago

The Medusa Server instance crashes (SIGABRT), due to memory allication failure after a period of time (1-4 hours observed). This happens both when running with medusa start and medusa develop.

Medusa version 1.3.1 with Postgres. Node v16.15.1 Mac OS

<--- Last few GCs --->

[77923:0x7fb5ab900000]  3441351 ms: Mark-sweep 4045.9 (4136.8) -> 4036.3 (4141.0) MB, 4754.0 / 0.0 ms  (average mu = 0.781, current mu = 0.394) task scavenge might not succeed
[77923:0x7fb5ab900000]  3450246 ms: Mark-sweep 4049.5 (4141.3) -> 4040.8 (4145.8) MB, 5770.7 / 0.1 ms  (average mu = 0.653, current mu = 0.351) task scavenge might not succeed

<--- JS stacktrace --->

FATAL ERROR: Reached heap limit Allocation failed - JavaScript heap out of memory
 1: 0x104e89a85 node::Abort() [/usr/local/bin/node]
 2: 0x104e89c08 node::OnFatalError(char const*, char const*) [/usr/local/bin/node]
 3: 0x105002a67 v8::Utils::ReportOOMFailure(v8::internal::Isolate*, char const*, bool) [/usr/local/bin/node]
 4: 0x105002a03 v8::internal::V8::FatalProcessOutOfMemory(v8::internal::Isolate*, char const*, bool) [/usr/local/bin/node]
 5: 0x1051a1445 v8::internal::Heap::FatalProcessOutOfMemory(char const*) [/usr/local/bin/node]
 6: 0x10519fcac v8::internal::Heap::CollectGarbage(v8::internal::AllocationSpace, v8::internal::GarbageCollectionReason, v8::GCCallbackFlags) [/usr/local/bin/node]
 7: 0x10522e37d v8::internal::ScavengeJob::Task::RunInternal() [/usr/local/bin/node]
 8: 0x104ef4bcb node::PerIsolatePlatformData::RunForegroundTask(std::__1::unique_ptr<v8::Task, std::__1::default_delete<v8::Task> >) [/usr/local/bin/node]
 9: 0x104ef3617 node::PerIsolatePlatformData::FlushForegroundTasksInternal() [/usr/local/bin/node]
10: 0x10583d52b uv__async_io [/usr/local/bin/node]
11: 0x105850c9b uv__io_poll [/usr/local/bin/node]
12: 0x10583da21 uv_run [/usr/local/bin/node]
13: 0x104dc2eaf node::SpinEventLoop(node::Environment*) [/usr/local/bin/node]
14: 0x104ec9f41 node::NodeMainInstance::Run(int*, node::Environment*) [/usr/local/bin/node]
15: 0x104ec9b99 node::NodeMainInstance::Run(node::EnvSerializeInfo const*) [/usr/local/bin/node]
16: 0x104e5768b node::Start(int, char**) [/usr/local/bin/node]
17: 0x7fff2039df3d start [/usr/lib/system/libdyld.dylib]
srindom commented 2 years ago

Thanks for reporting - does this happen after calling a specific endpoint or does it reach the OOM simply after running for a while?

rhijjawi commented 2 years ago

This happens after running medusa deploy after an unspecified amount of time (3-5h). My case is on a Digital Ocean droplet with 2GB of RAM.

rhijjawi commented 2 years ago
<--- Last few GCs --->

[58606:0x4e5d040] 16056726 ms: Mark-sweep 1964.2 (2000.5) -> 1963.5 (2000.2) MB, 4683.1 / 0.2 ms  (average mu = 0.780, current mu = 0.391) allocation failure scavenge might not succeed
[58606:0x4e5d040] 16064831 ms: Mark-sweep 1964.6 (2000.2) -> 1963.8 (2000.7) MB, 5098.1 / 0.1 ms  (average mu = 0.660, current mu = 0.371) allocation failure scavenge might not succeed

<--- JS stacktrace --->

FATAL ERROR: Reached heap limit Allocation failed - JavaScript heap out of memory 1: 0xb00e10 node::Abort() [node]
 2: 0xa1823b node::FatalError(char const*, char const*) [node]
 3: 0xcee09e v8::Utils::ReportOOMFailure(v8::internal::Isolate*, char const*, bool) [node]
 4: 0xcee417 v8::internal::V8::FatalProcessOutOfMemory(v8::internal::Isolate*, char const*, bool) [node]
 5: 0xea65d5  [node]
 6: 0xeb5cad v8::internal::Heap::CollectGarbage(v8::internal::AllocationSpace, v8::internal::GarbageCollectionReason, v8::GCCallbackFlags) [node]
 7: 0xeb89ae v8::internal::Heap::AllocateRawWithRetryOrFailSlowPath(int, v8::internal::AllocationType, v8::internal::AllocationOrigin, v8::internal::AllocationAlignment) [node]
 8: 0xe79d57 v8::internal::Factory::New(v8::internal::Handle<v8::internal::Map>, v8::internal::AllocationType) [node]
 9: 0xe89c91 v8::internal::Factory::NewProperSubString(v8::internal::Handle<v8::internal::String>, int, int) [node]
10: 0x10ead06 v8::internal::SharedFunctionInfo::GetSourceCodeHarmony(v8::internal::Handle<v8::internal::SharedFunctionInfo>) [node]
11: 0x1063880 v8::internal::JSFunction::ToString(v8::internal::Handle<v8::internal::JSFunction>) [node]
12: 0xd682aa v8::internal::Builtin_FunctionPrototypeToString(int, unsigned long*, v8::internal::Isolate*) [node]
13: 0x15e7dd9  [node]
olavurdj commented 2 years ago

@srindom I haven't observed any specific API calls that trigger this behaviour, it seems to be mostly work-over-time related. I'll try to see if I can narrow down it down. Do you have any overview of Node versions used in production setups?

octalpixel commented 2 years ago

@srindom I haven't observed any specific API calls that trigger this behaviour, it seems to be mostly work-over-time related. I'll try to see if I can narrow down it down. Do you have any overview of Node versions used in production setups?

I feel I can second this opinion. I've noticed the same behaviour locally , when there is any request for like an hours or more.

srindom commented 2 years ago

Most of the production setups we have been involved in run on Node 14 or 16.

Do you know what the max-old-space-size is set to? I believe that Node defaults to 512MB in some circumstances which might not be enough for Medusa to run in a production setting (although I don't have any hard stats to back this up).

As per the Node docs it is recommended to set this to ~1.5GB for a machine with 2GB RAM: https://nodejs.org/api/cli.html#--max-old-space-sizesize-in-megabytes

rhijjawi commented 2 years ago

How would we do this for Medusa?

olavurdj commented 2 years ago

@srindom I'm running it with a max-old-space-size of ~4GB.

@rhijjawi You can set the NODE_OPTIONS environment variable, either globally or when running the server. A simple approach is to just edit the script in your package.json, e.g.

"scripts": {
    "serve": "cross-env NODE_OPTIONS=--max-old-space-size=4096 medusa start",
    ...
}

Note: Make sure to add cross-env to your dev-dependencies.

AlexDigital974 commented 2 years ago

@olavurdj

I have 100% the exact same error as you . When i start medusa memory will start at around 380mb something and i can see increase bit by bit even if i do nothing until it crash

this is my medusa api core before crashing 3.29g. When i launch its just 380mb .I'm on localhost , doing nothing, chilling.not even launch admin or storefront.... WechatIMG1978

After lot testing , i find out is because as soon as you start to have a lot of data(not sure how much) in your medusa database , Customer-Product-Price etc you must absolutely MUST install/use 'REDIS' <---not optional , i didn't because i was on localhost and just started to play with medusa and while i had few/no data i didnt have any errors.

Maybe i should raise a ticket but my problem is reapetable i test it with 2 friend same memory leak error.

srindom commented 2 years ago

@AlexDigital974 this is super insightful - my immediate thought is that it has something to do with fakeredis not being configured correctly.

Will try to play around with this - thank you!

octalpixel commented 2 years ago

Quick thing to add as well. I believe redis get pinged most of the time as I ran out of 10k requests from upstash in less than 30-45mins, thats also on my development machine.

olavurdj commented 2 years ago

@AlexDigital974 Thank you! I am not running redis, so this is very likely the same issue I'm experiencing. I'll see if installing redis on my setup helps.

AlexDigital974 commented 2 years ago

@olavurdj great! it is likely the same issue as i have , if you do not have redis and you db is not empty this is it.

If it resolved you issue let us know i think as more people use medusa , having DB with some data & not Redis install is a very likely scenario that a lot of people will meet

rhijjawi commented 1 year ago

I can confirm that fakeredis is enabled on my instance of medusa.

On Wed, Jun 22, 2022, 05:38 Sebastian Rindom @.***> wrote:

@AlexDigital974 https://github.com/AlexDigital974 this is super insightful - my immediate thought is that it has something to do with fakeredis not being configured correctly.

Will try to play around with this - thank you!

— Reply to this email directly, view it on GitHub https://github.com/medusajs/medusa/issues/1651#issuecomment-1162878159, or unsubscribe https://github.com/notifications/unsubscribe-auth/AENVYPDB7WBH5MOZK5CHS5TVQLNKRANCNFSM5YTUETCA . You are receiving this because you were mentioned.Message ID: @.***>

magnusdr commented 1 year ago

I just recently disconnected a redis instance from our medusajs server in order to save hosting expenses, as we currently do not depend on any of the features redis enable. This caused OOM crashes for us as well.

Would be nice to have this resolved, as it can be quite cost saving, especially in a staging/preview environment.

haveamission commented 1 year ago

Is there a solution for this yet, other than "install redis"?

rhijjawi commented 1 year ago

fake-redis is RAM based, right? Why not install redis and not use it?

AlexDigital974 commented 1 year ago

@haveamission just install redis..... wherever its you server localhost ,premise, or cloud just install redis .its FREE. and i kid you not, you can have it up in like a minute...rather that waiting this fixed which even its fixed , its not supposed to replace a proper redis.

just use docker you dont even need to pay or even make a setup....

put below into a docker file , run via docker ->problem solved... and you can reuse that docker file to you staging-testing-developer team, easy.

version: "3.8"
services:
  redis-server:
    image: "redis:alpine"
    command: redis-server
    ports:
      - "6379:6379"
    volumes: 
      - redis-data:/data

volumes:
  redis-data:
    driver: local
magnusdr commented 1 year ago

@AlexDigital974 running Redis locally is no problem.

However, if you want to deploy PR builds in your review process, a Redis database on Digital Ocean would add 15$ to your bill. Being able to run a fully functional MedusaJS server on low resources makes automated testing and reviewing easier.

I'm not saying its a deal breaker, but if this could be prevented, it is much appreciated 🙌

rhijjawi commented 1 year ago

@magnusdr you make a fair point, having a fix for 'fakeredis' would be optimal.

Does anyone know if there's a way to restart the 'fakeredis' instance via shell? If so, could we cron it ever x hours?

Also, a redis instance on the same machine as the Medusa instance is negligible in size, right? We don't NEED a whole other VM.

olivermrbl commented 1 year ago

We will soon deprecate the fake Redis module and default to an event bus with no effect when events are emitted and subscribed to. It will just let the events pass through, such that you can deploy without Redis (even though that's not recommended) or boot up preview environments as mentioned by @magnusdr.

You can find the WIP implementation in this PR.

Right now, we use ioredis-mock (link) to power the fake Redis, so you might be able to find a command in their documentation to restart or flush the instance.

Hope this helps.

olivermrbl commented 1 year ago

I will close this issue as we have identified the culprit and are to introduce a fix soon.

chrisp-code commented 1 year ago

I'm encountering the opposite issue. I have 102 products, 7960 variants, 816 product options, and 63680 product option values, and money amounts to support all the possible destinations and currencies (~135 per variant).

If I have redis enabled, about 2 minutes after running npm run dev, with no requests, memory starts to go up until it hits the OOM error. If I disable redis, it stays consistently at 296MB.

Edit: it eventually fails with redis disabled too.

olivermrbl commented 1 year ago

@chrisp-code What version of @medusajs/medusa do you have installed?

olivermrbl commented 1 year ago

If you are using v1.7.10 or above, you might face an issue with Bull (our message queue powered by Redis), that was uncovered today.

If this is the case, give this version of the core a spin: yarn add @medusajs/medusa@1.7.12-snapshot-20230306155104

Should resolve your issues. Let me know how it goes :)

chrisp-code commented 1 year ago

@chrisp-code What version of @medusajs/medusa do you have installed?

Was on 1.7.8 and then upgrade to latest (1.7.11) and both presented the behavior.

Truncated my DB so I could start over. Trying to make some modifications so I can reduce the number of variants and options.

olivermrbl commented 1 year ago

Got it, would love if you could keep me posted :)

chrisp-code commented 1 year ago

Was able to reduce the number of variants by 60% (this is still only 2% of my current number of products so this is not ideal) but still get OOM error after restarting the service (it was stable before restarting, but had 7,810 "bull:EventBusService:queue:job_{id}" keys in redis with 1.7.11 will try snapshot next).

Both 1.7.11 and 1.7.12-snapshot-20230306155104 have the same behavior of OOM after ~1-2 minutes of idle. No immediate issue with redis disabled but still eventually OOMs.

I ran redis-cli flushall and was looking at the output of redis-cli keys *. I don't know if this means anything to you, but I did notice this difference in output.

Before OOM (after server start and at idle):

1) "bull:EventBusService:queue:stalled-check"
2) "bull:EventBusService:queue:1:lock"
3) "bull:EventBusService:queue:id"
4) "bull:EventBusService:queue:1"
5) "bull:scheduled-jobs:queue:stalled-check"
6) "bull:EventBusService:queue:active"

After (line 1 is sometimes present or not):

1) "bull:EventBusService:queue:1:lock"
2) "bull:EventBusService:queue:id"
3) "bull:EventBusService:queue:1"
4) "bull:EventBusService:queue:active"
chrisp-code commented 1 year ago

Truncated my DB again and re-ran the product loading with 1.7.12-snapshot-20230306155104, no change in behavior. Still get OOM after ~1 min at idle after restart.

Didn't see the 7K "bull:EventBusService:queue:job_{id}" keys with this version though.

redis-cli keys * before restarting:

1) "bull:EventBusService:queue:id"
2) "bull:EventBusService:queue:3322"
3) "sess:f-ioRYmWH62ECcPoZBROw39315jgCWGw"
4) "bull:EventBusService:queue:active"

During memory increase:

1) "sess:f-ioRYmWH62ECcPoZBROw39315jgCWGw"
2) "bull:scheduled-jobs:queue:stalled-check"
3) "bull:EventBusService:queue:stalled"
4) "bull:EventBusService:queue:active"
5) "bull:EventBusService:queue:stalled-check"
6) "bull:EventBusService:queue:id"
7) "bull:EventBusService:queue:3322"
8) "bull:EventBusService:queue:3323:lock"
9) "bull:EventBusService:queue:3323"

After crash:

1) "bull:EventBusService:queue:wait"
2) "sess:f-ioRYmWH62ECcPoZBROw39315jgCWGw"
3) "bull:EventBusService:queue:active"
4) "bull:EventBusService:queue:id"
5) "bull:EventBusService:queue:3322"
6) "bull:EventBusService:queue:3323"
chrisp-code commented 1 year ago

Cleared DB, removed medusa-plugin-meilisearch, and reloaded products again - the issue did not happen again.

Seems like loading a bunch of products & variants makes meilisearch fall behind on processing or not clearing old subscriber events?

chrisp-code commented 1 year ago

For anyone experiencing the issue I listed above, I found out that reducing the batch_size from the default (1000) to something lower fixed the issue with meilisearch (you can set it on the options object when configuring the plugin).

This only appears to happen if you have products with a lot of variants, options, and prices. I ended up setting batch_size to 10 and it still hits a peak of 1.2 GB memory while processing.

GianniVullo commented 10 months ago

Hi,

Context I was confronted to this error while trying to deploy backend along with the admin plugin. First I deployed with Digital Ocean App then, assuming Digital Ocean was the problem, I tried to deploy to Railway but again, facing exact same error code as OP.

Solution

  {
    resolve: "@medusajs/admin",
    /** @type {import('@medusajs/admin').PluginOptions} */
    options: {
      autoRebuild: false, // Here is the fix
      serve: true // I think this is the default, so it might be useless
    },
  },

The way I understand it is that with autoRebuild set to true, the cloud providers were trying to launch and rebuild the admin at the same time, causing the memory to be saturated. But this is only an humble guess.

It seems to me that this piece of information should be sitting in all of medusaJS "deploy" documentation sections. As this would very easily avoid frustrations and even quitting from some of your users who, like me, are less experimented.

Thank you for your project, I like medusa JS.