nodejs / node

Node.js JavaScript runtime ✨🐢🚀✨
https://nodejs.org
Other
107.8k stars 29.7k forks source link

Fatal error in scavenger #36380

Open zzz08900 opened 3 years ago

zzz08900 commented 3 years ago

What steps will reproduce the bug?

We made this Distributed Computing Framework for Node.js(https://github.com/dcfjs/dcf) with nodeJS 8 and everything was fine. But while trying to upgrade to newer LTS versions of node, the worker process (the node process that is actually doing all the computation that involves reconstructing objects/functions from string) crashes under extensive computation load.

We tried a few versions from 10.xx, 12.xx and 14.xx respectively but all of them crashes with error messages related to the scavenger.

We tried a debug build of nodeJS 14.15.1, the debug build always crashes with the same error message: # Fatal error in ../deps/v8/src/heap/scavenger-inl.h, line 376 # Debug check failed: Heap::InFromPage(object).

The complete error message is attached below.

How often does it reproduce? Is there a required condition?

It depends. On nodeJS 14.15.1: It's more likely while running more worker than the number of CPU cores on the host machine, usually within 10 minutes. It's less likely while running worker process with parallel scavenger disabled and single-threaded GC (usually I can get away with 20 to 30 minutes under heavy computation load, but eventually one of the workers will crash).

What is the expected behavior?

Does not crash.

What do you see instead?

Full error message from nodeJS 14.15.1 debug build attached below:

#
# Fatal error in ../deps/v8/src/heap/scavenger-inl.h, line 376
# Debug check failed: Heap::InFromPage(object).
#
#
#
#FailureMessage Object: 0x7ffd75ba5580
 1: 0xee53d9 node::DumpBacktrace(_IO_FILE*) [/.../nodeSource/node-v14.15.1/node_g]
 2: 0x1076cd7  [/.../nodeSource/node-v14.15.1/node_g]
 3: 0x1076cf7  [/.../nodeSource/node-v14.15.1/node_g]
 4: 0x2a5207a V8_Fatal(char const*, int, char const*, ...) [/.../nodeSource/node-v14.15.1/node_g]
 5: 0x2a520a3  [/.../nodeSource/node-v14.15.1/node_g]
 6: 0x1600f54  [/.../nodeSource/node-v14.15.1/node_g]
 7: 0x1602e37 v8::internal::RootScavengeVisitor::VisitRootPointers(v8::internal::Root, char const*, v8::internal::FullObjectSlot, v8::internal::FullObjectSlot) [/.../nodeSource/node-v14.15.1/node_g]
 8: 0x1561bb1 v8::internal::Heap::IterateRoots(v8::internal::RootVisitor*, v8::base::EnumSet<v8::internal::SkipRoot, int>) [/.../nodeSource/node-v14.15.1/node_g]
 9: 0x1607af6 v8::internal::ScavengerCollector::CollectGarbage() [/.../nodeSource/node-v14.15.1/node_g]
10: 0x1562fd1 v8::internal::Heap::Scavenge() [/.../nodeSource/node-v14.15.1/node_g]
11: 0x1579fd8 v8::internal::Heap::PerformGarbageCollection(v8::internal::GarbageCollector, v8::GCCallbackFlags) [/.../nodeSource/node-v14.15.1/node_g]
12: 0x157a4df v8::internal::Heap::CollectGarbage(v8::internal::AllocationSpace, v8::internal::GarbageCollectionReason, v8::GCCallbackFlags) [/.../nodeSource/node-v14.15.1/node_g]
13: 0x15fb26c v8::internal::ScavengeJob::Task::RunInternal() [/.../nodeSource/node-v14.15.1/node_g]
14: 0x13f5f66 non-virtual thunk to v8::internal::CancelableTask::Run() [/.../nodeSource/node-v14.15.1/node_g]
15: 0x10761a7 node::PerIsolatePlatformData::RunForegroundTask(std::unique_ptr<v8::Task, std::default_delete<v8::Task> >) [/.../nodeSource/node-v14.15.1/node_g]
16: 0x10767f5 node::PerIsolatePlatformData::FlushForegroundTasksInternal() [/.../nodeSource/node-v14.15.1/node_g]
17: 0x10752a5 node::PerIsolatePlatformData::FlushTasks(uv_async_s*) [/.../nodeSource/node-v14.15.1/node_g]
18: 0x1e82c60  [/.../nodeSource/node-v14.15.1/node_g]
19: 0x1e9bf05  [/.../nodeSource/node-v14.15.1/node_g]
20: 0x1e83639 uv_run [/.../nodeSource/node-v14.15.1/node_g]
21: 0x101df1a node::NodeMainInstance::Run() [/.../nodeSource/node-v14.15.1/node_g]
22: 0xf4e88f node::Start(int, char**) [/.../nodeSource/node-v14.15.1/node_g]
23: 0x26babc0 main [/.../nodeSource/node-v14.15.1/node_g]
24: 0x7f8428797840 __libc_start_main [/lib/x86_64-linux-gnu/libc.so.6]
25: 0xe91389 _start [/.../nodeSource/node-v14.15.1/node_g]

Additional information

The test machine is using i7-7820HQ CPU and 32G of memory. The debug build didn't change any default compilation flag.

zzz08900 commented 3 years ago

The same also happens for our dcfjs v2(https://github.com/dcfjs/dcf2) We were not able to find a easy way to reproduce the problem, the crash is only observed under heavy computation load, i.e. 40 minutes or more on 4 CPU cores.

But if any additional information related to dcf/dcf2 is needed, please let me know and I'll provide as much info as I can.

Trott commented 3 years ago

Node.js 15.x has a newer V8 than 14.x etc. Any chance you could test with 15.x and see if the problem persists?

@nodejs/v8

dinfuehr commented 3 years ago

Hi, looks like this could be a missing write barrier. V8 has some command line flags, which might help reproducing this bug: --stress-incremental-marking, --stress-scavenge and/or --verify-heap. Also try to reproduce this on the latest version of Node/V8, it could be that this is already fixed.

zzz08900 commented 3 years ago

Node.js 15.x has a newer V8 than 14.x etc. Any chance you could test with 15.x and see if the problem persists?

@nodejs/v8

Thanks for the heads up, I'll be trying it out.

zzz08900 commented 3 years ago

Just ran some quick stress test (30 some minutes) with NodeJS 15.4.0. No crash was observed :) I'll be setting up a more serious torture test and see if everything is really holding up together.

zzz08900 commented 3 years ago

Nope, the problem persists. But in most cases NodeJS 15.4.0 fails straight with signal 11/code 139. I'm building a debug build of NodeJS 15.4.0 now and will post error message later.

dinfuehr commented 3 years ago

Hi, did you try to add some of the flags mentioned above to reproduce the crash faster? As soon as we have a reliable and fast way to reproduce the bug locally, I can try to start investigating it.

zzz08900 commented 3 years ago

Hi, did you try to add some of the flags mentioned above to reproduce the crash faster? As soon as we have a reliable and fast way to reproduce the bug locally, I can try to start investigating it.

Thanks in advance, I'll tweak with them later. I was hoping to get the whole thing sorted out with an upgrade to NodeJS 15 :(

zzz08900 commented 3 years ago

Hi, we just found triggering GC manually every once a while (almost) fixed the crash on nodeJS 14.15.1. Is there any way of knowing which line of JS code was being executed just before nodeJS crash?

Trott commented 3 years ago

Hi, we just found triggering GC manually every once a while (almost) fixed the crash on nodeJS 14.15.1. Is there any way of knowing which line of JS code was being executed just before nodeJS crash?

/ping @nodejs/diagnostics

joyeecheung commented 3 years ago

@zzz08900 You can try setting up your system so that a core dump is produced when the process crashes, and use https://github.com/nodejs/llnode to load the core dump along with the Node.js executable to get the JS stack trace at the time of the crash.

nmuthusamy commented 6 months ago

@zzz08900 I understand this thread has not been active for a while. I hope you have tried using the latest node version, which might have addressed the issue.