Closed rpsantosa closed 3 years ago
@trentlo @chsigg Any idea what's going on?
@trentlo @chsigg Any idea what's going on?
This is a known issue that we mention to you in the last meeting.
@nouiz is back. He knows more. (@bas-aarts FYI.)
This diff fix at least part of the problem:
diff --git a/tensorflow/core/common_runtime/gpu/gpu_cudamallocasync_allocator.h b/tensorflow/core/common_runtime/gpu/gpu_cudamallocasync_allocator.h
index 8c7613992ea..b04c592046e 100644
--- a/tensorflow/core/common_runtime/gpu/gpu_cudamallocasync_allocator.h
+++ b/tensorflow/core/common_runtime/gpu/gpu_cudamallocasync_allocator.h
@@ -67,7 +67,7 @@ class GpuCudaMallocAsyncAllocator : public Allocator {
explicit GpuCudaMallocAsyncAllocator(PlatformDeviceId platform_device_id,
size_t pool_size,
bool reserve_memory = false,
- bool compute_stats = false);
+ bool compute_stats = true);
~GpuCudaMallocAsyncAllocator() override;
string Name() override { return name_; }
void* AllocateRaw(size_t alignment, size_t num_bytes) override;
But I have problems building upstream TF in our container. So I have difficulty investigating this. Here is the error in case you have an idea what is going on:
Execution platform: @local_execution_config_platform//:platform
Traceback (most recent call last):
File "/root/.cache/bazel/_bazel_root/45a7ca8b5c99c684c2d5c22cdd8175f0/execroot/org_tensorflow/bazel-out/host/bin/tensorflow/python/keras/api/create_tensorflow.python_api_keras_python_api_gen.runfiles/org_tensorflow/tensor
flow/python/pywrap_tensorflow.py", line 64, in <module>
from tensorflow.python._pywrap_tensorflow_internal import *
ImportError: /root/.cache/bazel/_bazel_root/45a7ca8b5c99c684c2d5c22cdd8175f0/execroot/org_tensorflow/bazel-out/host/bin/tensorflow/python/keras/api/create_tensorflow.python_api_keras_python_api_gen.runfiles/org_tensorflow/t
ensorflow/python/_pywrap_tensorflow_internal.so: undefined symbol: _ZN10tensorflow7functor12UnaryFunctorIN5Eigen9GpuDeviceENS0_3negINS2_4halfEEEEclERKS3_NS2_9TensorMapINS2_6TensorIS5_Li1ELi1ElEELi16ENS2_11MakePointerEEENSA_
INSB_IKS5_Li1ELi1ElEELi16ESD_EE
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/root/.cache/bazel/_bazel_root/45a7ca8b5c99c684c2d5c22cdd8175f0/execroot/org_tensorflow/bazel-out/host/bin/tensorflow/python/keras/api/create_tensorflow.python_api_keras_python_api_gen.runfiles/org_tensorflow/tensor
flow/python/tools/api/generator/create_python_api.py", line 26, in <module>
from tensorflow.python.tools.api.generator import doc_srcs
File "/root/.cache/bazel/_bazel_root/45a7ca8b5c99c684c2d5c22cdd8175f0/execroot/org_tensorflow/bazel-out/host/bin/tensorflow/python/keras/api/create_tensorflow.python_api_keras_python_api_gen.runfiles/org_tensorflow/tensor
flow/python/__init__.py", line 40, in <module>
from tensorflow.python.eager import context
File "/root/.cache/bazel/_bazel_root/45a7ca8b5c99c684c2d5c22cdd8175f0/execroot/org_tensorflow/bazel-out/host/bin/tensorflow/python/keras/api/create_tensorflow.python_api_keras_python_api_gen.runfiles/org_tensorflow/tensor
flow/python/eager/context.py", line 35, in <module>
from tensorflow.python import pywrap_tfe
File "/root/.cache/bazel/_bazel_root/45a7ca8b5c99c684c2d5c22cdd8175f0/execroot/org_tensorflow/bazel-out/host/bin/tensorflow/python/keras/api/create_tensorflow.python_api_keras_python_api_gen.runfiles/org_tensorflow/tensor
flow/python/pywrap_tfe.py", line 28, in <module>
from tensorflow.python import pywrap_tensorflow
File "/root/.cache/bazel/_bazel_root/45a7ca8b5c99c684c2d5c22cdd8175f0/execroot/org_tensorflow/bazel-out/host/bin/tensorflow/python/keras/api/create_tensorflow.python_api_keras_python_api_gen.runfiles/org_tensorflow/tensor
flow/python/pywrap_tensorflow.py", line 83, in <module>
raise ImportError(msg)
ImportError: Traceback (most recent call last):
File "/root/.cache/bazel/_bazel_root/45a7ca8b5c99c684c2d5c22cdd8175f0/execroot/org_tensorflow/bazel-out/host/bin/tensorflow/python/keras/api/create_tensorflow.python_api_keras_python_api_gen.runfiles/org_tensorflow/tensor
flow/python/pywrap_tensorflow.py", line 64, in <module>
from tensorflow.python._pywrap_tensorflow_internal import *
ImportError: /root/.cache/bazel/_bazel_root/45a7ca8b5c99c684c2d5c22cdd8175f0/execroot/org_tensorflow/bazel-out/host/bin/tensorflow/python/keras/api/create_tensorflow.python_api_keras_python_api_gen.runfiles/org_tensorflow/t
ensorflow/python/_pywrap_tensorflow_internal.so: undefined symbol: _ZN10tensorflow7functor12UnaryFunctorIN5Eigen9GpuDeviceENS0_3negINS2_4halfEEEEclERKS3_NS2_9TensorMapINS2_6TensorIS5_Li1ELi1ElEELi16ENS2_11MakePointerEEENSA_
INSB_IKS5_Li1ELi1ElEELi16ESD_EE
I fixed the above link error. But now compiling TF is way longer... So I'm still blocked. See this new nvbug: https://github.com/tensorflow/tensorflow/issues/48966
There is more ( maybe it helps): I trained a model on images data using tensorflow 2.5.0-rc3. when making predicitnons, it resulted in Out of Memory error. Then i downgraded to 2.4.1, loading the same pre-trainned model and data ( same batch_size,etc). The predicitons were just fine, using at least, 1.5GB less of my GPU memory ( it has 8GB).
System information
Have I written custom code (as opposed to using a stock example script provided in TensorFlow): my code
OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Windows 10/Rstudio
Mobile device (e.g. iPhone 8, Pixel 2, Samsung Galaxy) if the issue happens on mobile device:
TensorFlow installed from (source or binary):
TensorFlow version (use command below): "v2.5.0-rc0-36-g0d1805aede0"
Python version: 3.7.3
CUDA/cuDNN version: 11.2.1
GPU model and memory: 3070/8G
Describe the current behavior To work with large files, like 3GB, it returns an error asking for setting that flag. The flat set, TF_GPU_ALLOCATOR=cuda_malloc_async , it can handle the object fine, but tensorflow no longer is able to run training, or load saved models, even from keras.applications
Describe the expected behavior Load models without any error
Standalone code to reproduce the issue on rstudio, but i belive would be the same on python
a<- application_densenet121(input_shape = c(256,256,3), include_top = F)
2021-05-02 07:59:46.375811: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1733] Found device 0 with properties: pciBusID: 0000:01:00.0 name: NVIDIA GeForce RTX 3070 computeCapability: 8.6 coreClock: 1.815GHz coreCount: 46 deviceMemorySize: 8.00GiB deviceMemoryBandwidth: 417.29GiB/s 2021-05-02 07:59:46.376160: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1871] Adding visible gpu devices: 0 2021-05-02 07:59:46.376323: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1258] Device interconnect StreamExecutor with strength 1 edge matrix: 2021-05-02 07:59:46.376479: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1264] 0 2021-05-02 07:59:46.376581: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1277] 0: N Error in py_call_impl(callable, dots$args, dots$keywords) : InternalError: No allocator statistics
Other info / logs Include any logs or source code that would be helpful to diagnose the problem. If including tracebacks, please include the full traceback. Large logs and files should be attached.
This error, " Error in py_call_impl(callable, dots$args, dots$keywords) : InternalError: No allocator statistics " happens in many other circunstancies, loading saved model, or even to run any model.
happend on tf 2.5.0-rc1/ 2.5.0-rc2/2.4.1. If disable the flag, the error doesnt happen, but tensorflow cant handle larger tensors, like images over 2GB.