Open Hyodori04 opened 1 week ago
@gaikwadrahul8 Hi , gaikwadrahul8
It seems that I've found a solution.
After examining the core dump, I suspected it was related to the oneDNN source. So, I explicitly enabled the option by setting TF_ENABLE_ONEDNN_OPTS=1.
As a result, I saw a log message I hadn't encountered before:
"oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable TF_ENABLE_ONEDNN_OPTS=0."
Since then, the application has not crashed.
Although I haven't fully understood the exact reason, I would appreciate your thoughts on this
Hi, @Hyodori04
I apologize for the delayed response and good to hear that your application is not crashing after enabling the TF_ENABLE_ONEDNN_OPTS=1
flag, when you enable TF_ENABLE_ONEDNN_OPTS=1
TensorFlow utilizes custom operations provided by the oneDNN library for better performance on Intel CPUs. These operations are optimized to take advantage of Intel CPU features like SIMD (Single Instruction, Multiple Data) instructions and other hardware-specific optimizations.
The message warns that enabling oneDNN optimizations can lead to slightly different numerical results compared to TensorFlow's default CPU implementation or other libraries. This is due to variations in computation order and floating-point round-off errors that may occur as a result of how oneDNN optimizes and parallelizes computations.
Disabling oneDNN optimizations (TF_ENABLE_ONEDNN_OPTS=0
) can significantly impact performance especially on Intel CPUs where oneDNN is designed to leverage hardware-specific optimizations (like SIMD instructions). If your application heavily depends on TensorFlow for computationally intensive tasks the lack of optimization could lead to slower execution which might manifest as crashes under load or when handling large datasets.
For memory Leak Diagnosis :
Monitor Memory Usage: Use tools like docker stats
or docker stats <container_id>
to monitor the memory usage of your Node.js server container over time. Look for trends where memory consumption increases sharply or steadily without decreasing after processing requests. You can also use tf.profile
Check for Resource Exhaustion: Determine if the crashes coincide with high CPU or memory usage. This can indicate that your server is running out of resources leading to crashes.
Review Docker Configuration: Ensure that your Docker container is configured with appropriate memory limits (--memory
and --memory-swap
flags) to prevent it from consuming excessive resources.
Double-check your code for any tensors created outside the tf.tidy block or during intermediate computations within the model.predict
call. Ensure they are disposed of using tf.dispose after they are no longer needed.
Thank you for your cooperation and patience.
I have already checked memory Leak Diagnosis and tf.tidy. There is no memory increase before crash and tf code that is not handled by tf.tidy, tf.dispose
I think it's kind of bug that if not using Onednn tf is crashed because onednn is for optimiztion. I want to know what part of code make crash when not using onednn but it's not easy for me. Maybe later you guys or me can confirm wrong code
System information
Describe the current behavior
We serve our service in docker node. If there are several sequential requests that use model.predict, our node server is killed I think there is some kind of memory leak because error logs are like And docker metrics have similar memory size
Describe the expected behavior
Memory leak error doesn't happen
Standalone code to reproduce the issue Provide a reproducible test case that is the bare minimum necessary to generate the problem. If possible, please share a link to Colab/CodePen/any notebook.
Other info / logs
lldb trace