tensorflow / tfjs

A WebGL accelerated JavaScript library for training and deploying ML models.
https://js.tensorflow.org
Apache License 2.0
18.25k stars 1.92k forks source link

memory leak (core dumped) problem in tfjs-node #8312

Open Hyodori04 opened 1 week ago

Hyodori04 commented 1 week ago

System information

Describe the current behavior

We serve our service in docker node. If there are several sequential requests that use model.predict, our node server is killed I think there is some kind of memory leak because error logs are like And docker metrics have similar memory size

  1. free(): invalid size Aborted (core dumped)

  2. segmentation fault (core dumped)
  3. corrupted size vs. prev_size (core dumped)

Describe the expected behavior

Memory leak error doesn't happen

Standalone code to reproduce the issue Provide a reproducible test case that is the bare minimum necessary to generate the problem. If possible, please share a link to Colab/CodePen/any notebook.

interface GetImageAnimalInfo {
  model: tf.GraphModel;
  arrayBuffer: ArrayBuffer;
  mutex: Mutex;
}

export const getImageAnimalInfo = async ({
  model,
  arrayBuffer,
  mutex,
}: GetImageAnimalInfo) => {
  const release = await mutex.acquire();

  try {
    const prediction = tf.tidy(() => {
      const uint8Array = new Uint8Array(arrayBuffer);

      const inputTensor = tf.node
        .decodeImage(uint8Array, IMAGE_DECODE_CHANNEL)
        .resizeNearestNeighbor([RESIZE_DIMENSION, RESIZE_DIMENSION])
        .toFloat();

      const input = inputTensor.div(tf.scalar(SCALAR)).expandDims(0);

      return model.predict(input) as tf.Tensor;
    });

    const result = await prediction.data();
    prediction.dispose();

    return result;
  } catch (err) {
    logger.error(err);
    return [];
  } finally {
    release();
  }
};

Other info / logs

lldb trace

(lldb) bt
* thread #1, name = 'next-server (v', stop reason = signal SIGABRT
  * frame #0: 0x00007f1d670c5e2c libc.so.6`__pthread_kill_implementation(threadid=<unavailable>, signo=6, no_tid=<unavailable>) at pthread_kill.c:44:76
    frame #1: 0x00007f1d67076fb2 libc.so.6`__GI_raise(sig=6) at raise.c:26:13
    frame #2: 0x00007f1d67061472 libc.so.6`__GI_abort at abort.c:79:7
    frame #3: 0x00007f1d670ba430 libc.so.6`__libc_message(action=do_abort, fmt="") at libc_fatal.c:155:5
    frame #4: 0x00007f1d670cf7aa libc.so.6`malloc_printerr(str=<unavailable>) at malloc.c:5660:3
    frame #5: 0x00007f1d670d18a8 libc.so.6`_int_free(av=0x00007f1d6720dc60, p=0x0000000009c7dbb0, have_lock=0) at malloc.c:4602:9
    frame #6: 0x00007f1d670d3e8f libc.so.6`__GI___libc_free(mem=<unavailable>) at malloc.c:3385:7
    frame #7: 0x00007f1811cfa269 libtensorflow.so.2`dnnl::impl::cpu::x64::avx512_common_gemm_f32::sgemm_nocopy_driver(char const*, char const*, long, long, long, float const*, float const*, long, float const*, long, float const*, float*, long, float const*) + 1561
    frame #8: 0x00007f1811cfa9ca libtensorflow.so.2`dnnl::impl::cpu::x64::jit_avx512_common_gemm_f32(int, char const*, char const*, long const*, long const*, long const*, float const*, float const*, long const*, float const*, long const*, float const*, float*, long const*, float const*) + 1754
    frame #9: 0x00007f1811e8dd15 libtensorflow.so.2`dnnl_status_t dnnl::impl::cpu::x64::gemm_driver<float, float, float>(char const*, char const*, char const*, long const*, long const*, long const*, float const*, float const*, long const*, float const*, float const*, long const*, float const*, float const*, float*, long const*, float const*, bool, dnnl::impl::cpu::x64::pack_type, dnnl::impl::cpu::x64::gemm_pack_storage_t*, bool) + 4757
    frame #10: 0x00007f18119eae2c libtensorflow.so.2`dnnl::impl::cpu::extended_sgemm(char const*, char const*, long const*, long const*, long const*, float const*, float const*, long const*, float const*, long const*, float const*, float*, long const*, float const*, bool) + 300
    frame #11: 0x00007f181163c473 libtensorflow.so.2`dnnl_sgemm + 99
    frame #12: 0x00007f180e346161 libtensorflow.so.2`Eigen::internal::TensorContractionKernel<float, float, float, long, Eigen::internal::blas_data_mapper<float, long, 0, 0, 1>, Eigen::internal::TensorContractionInputMapper<float, long, 1, Eigen::TensorEvaluator<Eigen::TensorMap<Eigen::Tensor<float const, 2, 1, long>, 16, Eigen::MakePointer> const, Eigen::ThreadPoolDevice>, Eigen::array<long, 1ul>, Eigen::array<long, 1ul>, 8, true, false, 0, Eigen::MakePointer>, Eigen::internal::TensorContractionInputMapper<float, long, 0, Eigen::TensorEvaluator<Eigen::TensorMap<Eigen::Tensor<float const, 2, 1, long>, 16, Eigen::MakePointer> const, Eigen::ThreadPoolDevice>, Eigen::array<long, 1ul>, Eigen::array<long, 1ul>, 8, true, false, 0, Eigen::MakePointer> >::invoke(Eigen::internal::blas_data_mapper<float, long, 0, 0, 1> const&, Eigen::internal::ColMajorBlock<float, long> const&, Eigen::internal::ColMajorBlock<float, long> const&, long, long, long, float, float) + 145
    frame #13: 0x00007f180e350095 libtensorflow.so.2`Eigen::TensorEvaluator<Eigen::TensorContractionOp<Eigen::array<Eigen::IndexPair<long>, 1ul> const, Eigen::TensorMap<Eigen::Tensor<float const, 2, 1, long>, 16, Eigen::MakePointer> const, Eigen::TensorMap<Eigen::Tensor<float const, 2, 1, long>, 16, Eigen::MakePointer> const, Eigen::NoOpOutputKernel const> const, Eigen::ThreadPoolDevice>::EvalParallelContext<Eigen::TensorEvaluator<Eigen::TensorContractionOp<Eigen::array<Eigen::IndexPair<long>, 1ul> const, Eigen::TensorMap<Eigen::Tensor<float const, 2, 1, long>, 16, Eigen::MakePointer> const, Eigen::TensorMap<Eigen::Tensor<float const, 2, 1, long>, 16, Eigen::MakePointer> const, Eigen::NoOpOutputKernel const> const, Eigen::ThreadPoolDevice>::NoCallback, true, true, false, 0>::kernel(long, long, long, bool) + 549
    frame #14: 0x00007f180e351f4c libtensorflow.so.2`Eigen::TensorEvaluator<Eigen::TensorContractionOp<Eigen::array<Eigen::IndexPair<long>, 1ul> const, Eigen::TensorMap<Eigen::Tensor<float const, 2, 1, long>, 16, Eigen::MakePointer> const, Eigen::TensorMap<Eigen::Tensor<float const, 2, 1, long>, 16, Eigen::MakePointer> const, Eigen::NoOpOutputKernel const> const, Eigen::ThreadPoolDevice>::EvalParallelContext<Eigen::TensorEvaluator<Eigen::TensorContractionOp<Eigen::array<Eigen::IndexPair<long>, 1ul> const, Eigen::TensorMap<Eigen::Tensor<float const, 2, 1, long>, 16, Eigen::MakePointer> const, Eigen::TensorMap<Eigen::Tensor<float const, 2, 1, long>, 16, Eigen::MakePointer> const, Eigen::NoOpOutputKernel const> const, Eigen::ThreadPoolDevice>::NoCallback, true, true, false, 0>::pack_rhs(long, long) + 1036
    frame #15: 0x00007f180e352591 libtensorflow.so.2`Eigen::TensorEvaluator<Eigen::TensorContractionOp<Eigen::array<Eigen::IndexPair<long>, 1ul> const, Eigen::TensorMap<Eigen::Tensor<float const, 2, 1, long>, 16, Eigen::MakePointer> const, Eigen::TensorMap<Eigen::Tensor<float const, 2, 1, long>, 16, Eigen::MakePointer> const, Eigen::NoOpOutputKernel const> const, Eigen::ThreadPoolDevice>::EvalParallelContext<Eigen::TensorEvaluator<Eigen::TensorContractionOp<Eigen::array<Eigen::IndexPair<long>, 1ul> const, Eigen::TensorMap<Eigen::Tensor<float const, 2, 1, long>, 16, Eigen::MakePointer> const, Eigen::TensorMap<Eigen::Tensor<float const, 2, 1, long>, 16, Eigen::MakePointer> const, Eigen::NoOpOutputKernel const> const, Eigen::ThreadPoolDevice>::NoCallback, true, true, false, 0>::enqueue_packing_helper(long, long, long, bool) + 385
    frame #16: 0x00007f180e352558 libtensorflow.so.2`Eigen::TensorEvaluator<Eigen::TensorContractionOp<Eigen::array<Eigen::IndexPair<long>, 1ul> const, Eigen::TensorMap<Eigen::Tensor<float const, 2, 1, long>, 16, Eigen::MakePointer> const, Eigen::TensorMap<Eigen::Tensor<float const, 2, 1, long>, 16, Eigen::MakePointer> const, Eigen::NoOpOutputKernel const> const, Eigen::ThreadPoolDevice>::EvalParallelContext<Eigen::TensorEvaluator<Eigen::TensorContractionOp<Eigen::array<Eigen::IndexPair<long>, 1ul> const, Eigen::TensorMap<Eigen::Tensor<float const, 2, 1, long>, 16, Eigen::MakePointer> const, Eigen::TensorMap<Eigen::Tensor<float const, 2, 1, long>, 16, Eigen::MakePointer> const, Eigen::NoOpOutputKernel const> const, Eigen::ThreadPoolDevice>::NoCallback, true, true, false, 0>::enqueue_packing_helper(long, long, long, bool) + 328
    frame #17: 0x00007f180e36afdb libtensorflow.so.2`void Eigen::TensorEvaluator<Eigen::TensorContractionOp<Eigen::array<Eigen::IndexPair<long>, 1ul> const, Eigen::TensorMap<Eigen::Tensor<float const, 2, 1, long>, 16, Eigen::MakePointer> const, Eigen::TensorMap<Eigen::Tensor<float const, 2, 1, long>, 16, Eigen::MakePointer> const, Eigen::NoOpOutputKernel const> const, Eigen::ThreadPoolDevice>::evalProductImpl<Eigen::TensorEvaluator<Eigen::TensorContractionOp<Eigen::array<Eigen::IndexPair<long>, 1ul> const, Eigen::TensorMap<Eigen::Tensor<float const, 2, 1, long>, 16, Eigen::MakePointer> const, Eigen::TensorMap<Eigen::Tensor<float const, 2, 1, long>, 16, Eigen::MakePointer> const, Eigen::NoOpOutputKernel const> const, Eigen::ThreadPoolDevice>::NoCallback, 0>(float*, Eigen::TensorEvaluator<Eigen::TensorContractionOp<Eigen::array<Eigen::IndexPair<long>, 1ul> const, Eigen::TensorMap<Eigen::Tensor<float const, 2, 1, long>, 16, Eigen::MakePointer> const, Eigen::TensorMap<Eigen::Tensor<float const, 2, 1, long>, 16, Eigen::MakePointer> const, Eigen::NoOpOutputKernel const> const, Eigen::ThreadPoolDevice>::NoCallback) const + 5643
    frame #18: 0x00007f180e36bf01 libtensorflow.so.2`Eigen::internal::TensorExecutor<Eigen::TensorAssignOp<Eigen::TensorMap<Eigen::Tensor<float, 2, 1, long>, 16, Eigen::MakePointer>, Eigen::TensorContractionOp<Eigen::array<Eigen::IndexPair<long>, 1ul> const, Eigen::TensorMap<Eigen::Tensor<float const, 2, 1, long>, 16, Eigen::MakePointer> const, Eigen::TensorMap<Eigen::Tensor<float const, 2, 1, long>, 16, Eigen::MakePointer> const, Eigen::NoOpOutputKernel const> const> const, Eigen::ThreadPoolDevice, true, (Eigen::internal::TiledEvaluation)0>::run(Eigen::TensorAssignOp<Eigen::TensorMap<Eigen::Tensor<float, 2, 1, long>, 16, Eigen::MakePointer>, Eigen::TensorContractionOp<Eigen::array<Eigen::IndexPair<long>, 1ul> const, Eigen::TensorMap<Eigen::Tensor<float const, 2, 1, long>, 16, Eigen::MakePointer> const, Eigen::TensorMap<Eigen::Tensor<float const, 2, 1, long>, 16, Eigen::MakePointer> const, Eigen::NoOpOutputKernel const> const> const&, Eigen::ThreadPoolDevice const&) + 657
    frame #19: 0x00007f1810b54bc0 libtensorflow.so.2`tensorflow::(anonymous namespace)::LaunchGeneric<Eigen::ThreadPoolDevice, float>::operator()(tensorflow::OpKernelContext*, tensorflow::Tensor const&, tensorflow::Tensor const&, int, int, int, int, tensorflow::Padding const&, std::vector<long, std::allocator<long> > const&, tensorflow::Tensor*, tensorflow::TensorFormat) (.isra.0.constprop.0) + 1712
    frame #20: 0x00007f1810b54e01 libtensorflow.so.2`tensorflow::LaunchConv2DOp<Eigen::ThreadPoolDevice, float>::operator()(tensorflow::OpKernelContext*, bool, bool, tensorflow::Tensor const&, tensorflow::Tensor const&, int, int, int, int, tensorflow::Padding const&, std::vector<long, std::allocator<long> > const&, tensorflow::Tensor*, tensorflow::TensorFormat) + 561
    frame #21: 0x00007f1810b55d3f libtensorflow.so.2`tensorflow::Conv2DOp<Eigen::ThreadPoolDevice, float>::Compute(tensorflow::OpKernelContext*) + 431
    frame #22: 0x00007f1806d7ffbb libtensorflow_framework.so.2`tensorflow::ThreadPoolDevice::Compute(tensorflow::OpKernel*, tensorflow::OpKernelContext*) + 75
    frame #23: 0x00007f180cce9187 libtensorflow.so.2`tensorflow::KernelAndDeviceOp::Run(tensorflow::ScopedStepContainer*, tensorflow::EagerKernelArgs const&, std::vector<absl::lts_20211102::variant<tensorflow::Tensor, tensorflow::TensorShape>, std::allocator<absl::lts_20211102::variant<tensorflow::Tensor, tensorflow::TensorShape> > >*, tensorflow::CancellationManager*, absl::lts_20211102::optional<tensorflow::EagerFunctionParams> const&, absl::lts_20211102::optional<tensorflow::ManagedStackTrace> const&, tensorflow::CoordinationServiceAgent*) + 2503
    frame #24: 0x00007f180cc997c9 libtensorflow.so.2`tensorflow::EagerKernelExecute(tensorflow::EagerContext*, absl::lts_20211102::InlinedVector<tensorflow::TensorHandle*, 4ul, std::allocator<tensorflow::TensorHandle*> > const&, absl::lts_20211102::optional<tensorflow::EagerFunctionParams> const&, std::unique_ptr<tensorflow::KernelAndDevice, tensorflow::core::RefCountDeleter> const&, tensorflow::GraphCollector*, tensorflow::CancellationManager*, absl::lts_20211102::Span<tensorflow::TensorHandle*>, absl::lts_20211102::optional<tensorflow::ManagedStackTrace> const&) + 649
    frame #25: 0x00007f180cc9ab89 libtensorflow.so.2`tensorflow::ExecuteNode::Run() + 457
    frame #26: 0x00007f180cce0d50 libtensorflow.so.2`tensorflow::EagerExecutor::SyncExecute(tensorflow::EagerNode*) + 1040
    frame #27: 0x00007f180cc94c46 libtensorflow.so.2`tensorflow::(anonymous namespace)::EagerLocalExecute(tensorflow::EagerOperation*, tensorflow::TensorHandle**, int*) + 5686
    frame #28: 0x00007f180cc952b4 libtensorflow.so.2`tensorflow::EagerExecute(tensorflow::EagerOperation*, tensorflow::TensorHandle**, int*) + 596
    frame #29: 0x00007f180b076ae0 libtensorflow.so.2`tensorflow::EagerOperation::Execute(absl::lts_20211102::Span<tensorflow::AbstractTensorHandle*>, int*) + 512
    frame #30: 0x00007f180ccf544a libtensorflow.so.2`tensorflow::CustomDeviceOpHandler::Execute(tensorflow::ImmediateExecutionOperation*, tensorflow::ImmediateExecutionTensorHandle**, int*) + 1498
    frame #31: 0x00007f1808d4f9f6 libtensorflow.so.2`TFE_Execute + 102
    frame #32: 0x00007f1d6477ffcb tfjs_binding.node`tfnodejs::TFJSBackend::ExecuteOp(napi_env__*, napi_value__*, napi_value__*, napi_value__*, napi_value__*) + 1867
    frame #33: 0x00007f1d647835f0 tfjs_binding.node`tfnodejs::ExecuteOp(napi_env__*, napi_callback_info__*) + 672
    frame #34: 0x0000000000c51049 node`v8impl::(anonymous namespace)::FunctionCallbackWrapper::Invoke(v8::FunctionCallbackInfo<v8::Value> const&) + 137
    frame #35: 0x000000000281d600 node`vtable for v8impl::(anonymous namespace)::FunctionCallbackWrapper + 16
    frame #36: 0x0000000000f0c81c node`v8::internal::InvokeFunctionCallback(v8::FunctionCallbackInfo<v8::Value> const&, void (*)(v8::FunctionCallbackInfo<v8::Value> const&)) + 204
    frame #37: 0x00000000018d759d node`Builtins_CallApiCallback + 221
    frame #38: 0x00007f1d3f891ffc
    frame #39: 0x00007f1d3f8138d0
    frame #40: 0x00007f1d3f810e07
    frame #41: 0x00007f1d3f97161f
    frame #42: 0x00007f1d3f807a67
    frame #43: 0x00007f1d3f78eacf
    frame #44: 0x00007f1d3f803aac
    frame #45: 0x00007f1d3f804f41
    frame #46: 0x00007f1d3f815297
    frame #47: 0x00007f1d3f80d45f
    frame #48: 0x00007f1d3f80d6f5
    frame #49: 0x00007f1d3f80b7fc
    frame #50: 0x00007f1d3f78eacf
    frame #51: 0x00007f1d3f80c7e0
    frame #52: 0x00007f1d3f81119a
    frame #53: 0x00007f1d3f807a67
    frame #54: 0x00007f1d3f78eacf
    frame #55: 0x00007f1d3f803aac
    frame #56: 0x00007f1d3f80b75e
    frame #57: 0x00007f1d3f817b25
    frame #58: 0x00007f1d3f80bfff
    frame #59: 0x00007f1d3f84ac27
    frame #60: 0x00007f1d3f80d7b5
    frame #61: 0x00007f1d3f80b7fc
    frame #62: 0x00007f1d3f78eacf
    frame #63: 0x00007f1d3f80c7e0
    frame #64: 0x00007f1d3f80b99b
    frame #65: 0x00007f1d3f81def8
    frame #66: 0x00007f1d3f80b1d4
    frame #67: 0x00007f1d3f806f55
    frame #68: 0x00007f1d3f80b7fc
    frame #69: 0x00007f1d3f78eacf
    frame #70: 0x00007f1d3f80c7e0
    frame #71: 0x00007f1d3f80b99b
    frame #72: 0x00007f1d3f958238
    frame #73: 0x00007f1d3f958419
    frame #74: 0x00007f1d3f90ab6d
    frame #75: 0x00007f1d3f90afbb
    frame #76: 0x00007f1d3f80b7fc
    frame #77: 0x00007f1d3f78eacf
    frame #78: 0x00007f1d3f80c7e0
    frame #79: 0x00007f1d3f80b99b
    frame #80: 0x00007f1d3f970e3f
    frame #81: 0x000000000190dd43 node`Builtins_AsyncFunctionAwaitResolveClosure + 67
    frame #82: 0x00000000019c5bf1 node`Builtins_PromiseFulfillReactionJob + 49
    frame #83: 0x00000000018fda34 node`Builtins_RunMicrotasks + 628
    frame #84: 0x00000000018d4003 node`Builtins_JSRunMicrotasksEntry + 131
    frame #85: 0x00000000010505bd node`v8::internal::(anonymous namespace)::Invoke(v8::internal::Isolate*, v8::internal::(anonymous namespace)::InvokeParams const&) + 1421
    frame #86: 0x00000000010517ef node`v8::internal::Execution::TryRunMicrotasks(v8::internal::Isolate*, v8::internal::MicrotaskQueue*) + 143
    frame #87: 0x0000000001085166 node`v8::internal::MicrotaskQueue::RunMicrotasks(v8::internal::Isolate*) + 150
    frame #88: 0x00000000018d5d1c node`Builtins_InterpreterEntryTrampoline + 220
    frame #89: 0x00000000010854fd node`v8::internal::MicrotaskQueue::PerformCheckpoint(v8::Isolate*) + 61
    frame #90: 0x0000000000f57eaf node`v8::internal::FunctionCallbackArguments::Call(v8::internal::CallHandlerInfo) + 303
    frame #91: 0x0000000000f5871d node`v8::internal::MaybeHandle<v8::internal::Object> v8::internal::(anonymous namespace)::HandleApiCallHelper<false>(v8::internal::Isolate*, v8::internal::Handle<v8::internal::HeapObject>, v8::internal::Handle<v8::internal::FunctionTemplateInfo>, v8::internal::Handle<v8::internal::Object>, unsigned long*, int) + 141
    frame #92: 0x0000000000f58be5 node`v8::internal::Builtin_HandleApiCall(int, unsigned long*, v8::internal::Isolate*) + 277
    frame #93: 0x0000000001963df6 node`Builtins_CEntry_Return1_ArgvOnStack_BuiltinExit + 54
    frame #94: 0x00007f1d3f6af766
    frame #95: 0x00000000018d40dc node`Builtins_JSEntryTrampoline + 92
    frame #96: 0x00000000018d3e03 node`Builtins_JSEntry + 131
    frame #97: 0x000000000105015b node`v8::internal::(anonymous namespace)::Invoke(v8::internal::Isolate*, v8::internal::(anonymous namespace)::InvokeParams const&) + 299
    frame #98: 0x00000000010511f4 node`v8::internal::Execution::Call(v8::internal::Isolate*, v8::internal::Handle<v8::internal::Object>, v8::internal::Handle<v8::internal::Object>, int, v8::internal::Handle<v8::internal::Object>*) + 100
    frame #99: 0x0000000000f138cd node`v8::Function::Call(v8::Local<v8::Context>, v8::Local<v8::Value>, int, v8::Local<v8::Value>*) + 333
    frame #100: 0x0000000000bc8daf node`node::InternalCallbackScope::Close() + 655
    frame #101: 0x0000000000bc912b node`node::InternalMakeCallback(node::Environment*, v8::Local<v8::Object>, v8::Local<v8::Object>, v8::Local<v8::Function>, int, v8::Local<v8::Value>*, node::async_context) + 619
    frame #102: 0x0000000000be029f node`node::AsyncWrap::MakeCallback(v8::Local<v8::Function>, int, v8::Local<v8::Value>*) + 127
    frame #103: 0x0000000000dd7417 node`node::StreamBase::CallJSOnreadMethod(long, v8::Local<v8::ArrayBuffer>, unsigned long, node::StreamBase::StreamBaseJSChecks) + 167
    frame #104: 0x0000000000dd77e6 node`node::EmitToJSStreamListener::OnStreamRead(long, uv_buf_t const&) + 502
    frame #105: 0x0000000000e9e738 node`node::crypto::TLSWrap::ClearOut() + 280
    frame #106: 0x0000000000ea00d0 node`node::crypto::TLSWrap::OnStreamRead(long, uv_buf_t const&) + 160
    frame #107: 0x0000000000ddf14f node`node::LibuvStreamWrap::OnUvRead(long, uv_buf_t const*) + 143
    frame #108: 0x0000000000ddf56a node`node::LibuvStreamWrap::ReadStart()::'lambda0'(uv_stream_s*, long, uv_buf_t const*)::_FUN(uv_stream_s*, long, uv_buf_t const*) + 90
    frame #109: 0x00000000018bdba5 node`uv__read(stream=0x00007f1d00000000) at stream.c:1143:7
    frame #110: 0x00000000018bded0 node`uv__stream_io(loop=<unavailable>, w=0x00007f1d00000000, events=112775744) at stream.c:1203:5
    frame #111: 0x00007ffefede27fa linux-vdso.so.1
    frame #112: 0x00000000018c593b node`uv__io_poll(loop=<unavailable>, timeout=<unavailable>) at linux.c:1485:11
    frame #113: 0x00000000018b1be7 node`uv_run(loop=0x00000000055eab40, mode=UV_RUN_DEFAULT) at core.c:447:5
    frame #114: 0x0000000000bc9be6 node`node::SpinEventLoopInternal(node::Environment*) + 342
    frame #115: 0x0000000000d0ce94 node`node::NodeMainInstance::Run(node::ExitCode*, node::Environment*) (.part.0) + 148
    frame #116: 0x0000000000d0d92d node`node::NodeMainInstance::Run() + 205
    frame #117: 0x0000000000c71c0f node`node::Start(int, char**) + 1423
    frame #118: 0x00007f1d6706224a libc.so.6`__libc_start_call_main(main=(node`main), argc=2, argv=0x00007ffefedb3238) at libc_start_call_main.h:58:16
    frame #119: 0x00007f1d67062305 libc.so.6`__libc_start_main_impl(main=(node`main), argc=2, argv=0x00007ffefedb3080, init=0x000000000000000e, fini=<unavailable>, rtld_fini=<unavailable>, stack_end=0x00007ffefedb3228) at libc-start.c:360:3
    frame #120: 0x0000000000bc630e node`_start + 46
Hyodori04 commented 1 week ago

@gaikwadrahul8 Hi , gaikwadrahul8

It seems that I've found a solution.

After examining the core dump, I suspected it was related to the oneDNN source. So, I explicitly enabled the option by setting TF_ENABLE_ONEDNN_OPTS=1.

As a result, I saw a log message I hadn't encountered before:

"oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable TF_ENABLE_ONEDNN_OPTS=0."

Since then, the application has not crashed.

Although I haven't fully understood the exact reason, I would appreciate your thoughts on this

gaikwadrahul8 commented 1 week ago

Hi, @Hyodori04

I apologize for the delayed response and good to hear that your application is not crashing after enabling the TF_ENABLE_ONEDNN_OPTS=1 flag, when you enable TF_ENABLE_ONEDNN_OPTS=1 TensorFlow utilizes custom operations provided by the oneDNN library for better performance on Intel CPUs. These operations are optimized to take advantage of Intel CPU features like SIMD (Single Instruction, Multiple Data) instructions and other hardware-specific optimizations.

The message warns that enabling oneDNN optimizations can lead to slightly different numerical results compared to TensorFlow's default CPU implementation or other libraries. This is due to variations in computation order and floating-point round-off errors that may occur as a result of how oneDNN optimizes and parallelizes computations.

Disabling oneDNN optimizations (TF_ENABLE_ONEDNN_OPTS=0) can significantly impact performance especially on Intel CPUs where oneDNN is designed to leverage hardware-specific optimizations (like SIMD instructions). If your application heavily depends on TensorFlow for computationally intensive tasks the lack of optimization could lead to slower execution which might manifest as crashes under load or when handling large datasets.

For memory Leak Diagnosis :

  1. Monitor Memory Usage: Use tools like docker stats or docker stats <container_id> to monitor the memory usage of your Node.js server container over time. Look for trends where memory consumption increases sharply or steadily without decreasing after processing requests. You can also use tf.profile

  2. Check for Resource Exhaustion: Determine if the crashes coincide with high CPU or memory usage. This can indicate that your server is running out of resources leading to crashes.

  3. Review Docker Configuration: Ensure that your Docker container is configured with appropriate memory limits (--memory and --memory-swap flags) to prevent it from consuming excessive resources.

Double-check your code for any tensors created outside the tf.tidy block or during intermediate computations within the model.predict call. Ensure they are disposed of using tf.dispose after they are no longer needed.

Thank you for your cooperation and patience.

Hyodori04 commented 1 week ago

I have already checked memory Leak Diagnosis and tf.tidy. There is no memory increase before crash and tf code that is not handled by tf.tidy, tf.dispose

I think it's kind of bug that if not using Onednn tf is crashed because onednn is for optimiztion. I want to know what part of code make crash when not using onednn but it's not easy for me. Maybe later you guys or me can confirm wrong code