kube-HPC / hkube

🐟 High Performance Computing over Kubernetes - Core Repo 🎣
http://hkube.io
MIT License
305 stars 20 forks source link

Pipeline driver queue recovery when the queue is larger #1429

Closed tamir321 closed 1 year ago

tamir321 commented 2 years ago

HKube micro-service Pipeline driver queue

Describe the bug Pipeline driver queue recovery when the queue is larger send 3000 pipeline execution jobs to HKUBE once the jobs start to work delete the pipeline driver queue Pod

The new pod start to receive the jobs and then crash with Node out of memory error

{"meta":{"type":"pipeline-driver-queue","hostName":"pipeline-driver-queue-74b9c64dbd-gvvr9","uptime":7097,"internal":{"component":"QUEUE"},"trace":null,"timestamp":1635143930934},"level":"info","message":"new job inserted to queue, queue size: 1981"}

<--- Last few GCs --->

[1:0x552c930] 63899 ms: Mark-sweep 2013.8 (2077.9) -> 1999.1 (2079.9) MB, 2633.7 / 0.0 ms (average mu = 0.085, current mu = 0.019) allocation failure scavenge might not succeed [1:0x552c930] 66575 ms: Mark-sweep 2017.0 (2081.4) -> 2002.0 (2082.4) MB, 2622.7 / 0.0 ms (average mu = 0.053, current mu = 0.020) allocation failure scavenge might not succeed

<--- JS stacktrace --->

FATAL ERROR: Ineffective mark-compacts near heap limit Allocation failed - JavaScript heap out of memory 1: 0xa3ac30 node::Abort() [node] 2: 0x98a45d node::FatalError(char const, char const) [node] 3: 0xbae25e v8::Utils::ReportOOMFailure(v8::internal::Isolate, char const, bool) [node] 4: 0xbae5d7 v8::internal::V8::FatalProcessOutOfMemory(v8::internal::Isolate, char const, bool) [node] 5: 0xd56125 [node] 6: 0xd56acb v8::internal::Heap::RecomputeLimits(v8::internal::GarbageCollector) [node] 7: 0xd6481c v8::internal::Heap::PerformGarbageCollection(v8::internal::GarbageCollector, v8::GCCallbackFlags) [node] 8: 0xd65684 v8::internal::Heap::CollectGarbage(v8::internal::AllocationSpace, v8::internal::GarbageCollectionReason, v8::GCCallbackFlags) [node] 9: 0xd680fc v8::internal::Heap::AllocateRawWithRetryOrFailSlowPath(int, v8::internal::AllocationType, v8::internal::AllocationOrigin, v8::internal::AllocationAlignment) [node] 10: 0xd2f51a v8::internal::Factory::NewFillerObject(int, bool, v8::internal::AllocationType, v8::internal::AllocationOrigin) [node] 11: 0x107d660 v8::internal::Runtime_AllocateInOldGeneration(int, unsigned long, v8::internal::Isolate) [node] 12: 0x13f5079 [node]

stale[bot] commented 1 year ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.