tensorflow / tensorflow

An Open Source Machine Learning Framework for Everyone
https://tensorflow.org
Apache License 2.0
185.79k stars 74.22k forks source link

Flag TF_GPU_ALLOCATOR=cuda_malloc_async ( to work with large tensors), results in: " Error in py_call_impl(callable, dots$args, dots$keywords) : InternalError: No allocator statistics " #48869

Closed rpsantosa closed 3 years ago

rpsantosa commented 3 years ago

System information

Describe the current behavior To work with large files, like 3GB, it returns an error asking for setting that flag. The flat set, TF_GPU_ALLOCATOR=cuda_malloc_async , it can handle the object fine, but tensorflow no longer is able to run training, or load saved models, even from keras.applications

Describe the expected behavior Load models without any error

Standalone code to reproduce the issue on rstudio, but i belive would be the same on python

a<- application_densenet121(input_shape = c(256,256,3), include_top = F)

2021-05-02 07:59:46.375811: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1733] Found device 0 with properties: pciBusID: 0000:01:00.0 name: NVIDIA GeForce RTX 3070 computeCapability: 8.6 coreClock: 1.815GHz coreCount: 46 deviceMemorySize: 8.00GiB deviceMemoryBandwidth: 417.29GiB/s 2021-05-02 07:59:46.376160: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1871] Adding visible gpu devices: 0 2021-05-02 07:59:46.376323: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1258] Device interconnect StreamExecutor with strength 1 edge matrix: 2021-05-02 07:59:46.376479: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1264] 0 2021-05-02 07:59:46.376581: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1277] 0: N Error in py_call_impl(callable, dots$args, dots$keywords) : InternalError: No allocator statistics

Other info / logs Include any logs or source code that would be helpful to diagnose the problem. If including tracebacks, please include the full traceback. Large logs and files should be attached.

This error, " Error in py_call_impl(callable, dots$args, dots$keywords) : InternalError: No allocator statistics " happens in many other circunstancies, loading saved model, or even to run any model.

happend on tf 2.5.0-rc1/ 2.5.0-rc2/2.4.1. If disable the flag, the error doesnt happen, but tensorflow cant handle larger tensors, like images over 2GB.

sanjoy commented 3 years ago

@trentlo @chsigg Any idea what's going on?

trentlo commented 3 years ago

@trentlo @chsigg Any idea what's going on?

This is a known issue that we mention to you in the last meeting.

@nouiz is back. He knows more. (@bas-aarts FYI.)

nouiz commented 3 years ago

This diff fix at least part of the problem:

diff --git a/tensorflow/core/common_runtime/gpu/gpu_cudamallocasync_allocator.h b/tensorflow/core/common_runtime/gpu/gpu_cudamallocasync_allocator.h
index 8c7613992ea..b04c592046e 100644
--- a/tensorflow/core/common_runtime/gpu/gpu_cudamallocasync_allocator.h
+++ b/tensorflow/core/common_runtime/gpu/gpu_cudamallocasync_allocator.h
@@ -67,7 +67,7 @@ class GpuCudaMallocAsyncAllocator : public Allocator {
   explicit GpuCudaMallocAsyncAllocator(PlatformDeviceId platform_device_id,
                                        size_t pool_size,
                                        bool reserve_memory = false,
-                                       bool compute_stats = false);
+                                       bool compute_stats = true);
   ~GpuCudaMallocAsyncAllocator() override;
   string Name() override { return name_; }
   void* AllocateRaw(size_t alignment, size_t num_bytes) override;

But I have problems building upstream TF in our container. So I have difficulty investigating this. Here is the error in case you have an idea what is going on:

Execution platform: @local_execution_config_platform//:platform
Traceback (most recent call last):
  File "/root/.cache/bazel/_bazel_root/45a7ca8b5c99c684c2d5c22cdd8175f0/execroot/org_tensorflow/bazel-out/host/bin/tensorflow/python/keras/api/create_tensorflow.python_api_keras_python_api_gen.runfiles/org_tensorflow/tensor
flow/python/pywrap_tensorflow.py", line 64, in <module>
    from tensorflow.python._pywrap_tensorflow_internal import *
ImportError: /root/.cache/bazel/_bazel_root/45a7ca8b5c99c684c2d5c22cdd8175f0/execroot/org_tensorflow/bazel-out/host/bin/tensorflow/python/keras/api/create_tensorflow.python_api_keras_python_api_gen.runfiles/org_tensorflow/t
ensorflow/python/_pywrap_tensorflow_internal.so: undefined symbol: _ZN10tensorflow7functor12UnaryFunctorIN5Eigen9GpuDeviceENS0_3negINS2_4halfEEEEclERKS3_NS2_9TensorMapINS2_6TensorIS5_Li1ELi1ElEELi16ENS2_11MakePointerEEENSA_
INSB_IKS5_Li1ELi1ElEELi16ESD_EE

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/root/.cache/bazel/_bazel_root/45a7ca8b5c99c684c2d5c22cdd8175f0/execroot/org_tensorflow/bazel-out/host/bin/tensorflow/python/keras/api/create_tensorflow.python_api_keras_python_api_gen.runfiles/org_tensorflow/tensor
flow/python/tools/api/generator/create_python_api.py", line 26, in <module>
    from tensorflow.python.tools.api.generator import doc_srcs
  File "/root/.cache/bazel/_bazel_root/45a7ca8b5c99c684c2d5c22cdd8175f0/execroot/org_tensorflow/bazel-out/host/bin/tensorflow/python/keras/api/create_tensorflow.python_api_keras_python_api_gen.runfiles/org_tensorflow/tensor
flow/python/__init__.py", line 40, in <module>
    from tensorflow.python.eager import context
  File "/root/.cache/bazel/_bazel_root/45a7ca8b5c99c684c2d5c22cdd8175f0/execroot/org_tensorflow/bazel-out/host/bin/tensorflow/python/keras/api/create_tensorflow.python_api_keras_python_api_gen.runfiles/org_tensorflow/tensor
flow/python/eager/context.py", line 35, in <module>
    from tensorflow.python import pywrap_tfe
  File "/root/.cache/bazel/_bazel_root/45a7ca8b5c99c684c2d5c22cdd8175f0/execroot/org_tensorflow/bazel-out/host/bin/tensorflow/python/keras/api/create_tensorflow.python_api_keras_python_api_gen.runfiles/org_tensorflow/tensor
flow/python/pywrap_tfe.py", line 28, in <module>
    from tensorflow.python import pywrap_tensorflow
  File "/root/.cache/bazel/_bazel_root/45a7ca8b5c99c684c2d5c22cdd8175f0/execroot/org_tensorflow/bazel-out/host/bin/tensorflow/python/keras/api/create_tensorflow.python_api_keras_python_api_gen.runfiles/org_tensorflow/tensor
flow/python/pywrap_tensorflow.py", line 83, in <module>
    raise ImportError(msg)
ImportError: Traceback (most recent call last):
  File "/root/.cache/bazel/_bazel_root/45a7ca8b5c99c684c2d5c22cdd8175f0/execroot/org_tensorflow/bazel-out/host/bin/tensorflow/python/keras/api/create_tensorflow.python_api_keras_python_api_gen.runfiles/org_tensorflow/tensor
flow/python/pywrap_tensorflow.py", line 64, in <module>
    from tensorflow.python._pywrap_tensorflow_internal import *
ImportError: /root/.cache/bazel/_bazel_root/45a7ca8b5c99c684c2d5c22cdd8175f0/execroot/org_tensorflow/bazel-out/host/bin/tensorflow/python/keras/api/create_tensorflow.python_api_keras_python_api_gen.runfiles/org_tensorflow/t
ensorflow/python/_pywrap_tensorflow_internal.so: undefined symbol: _ZN10tensorflow7functor12UnaryFunctorIN5Eigen9GpuDeviceENS0_3negINS2_4halfEEEEclERKS3_NS2_9TensorMapINS2_6TensorIS5_Li1ELi1ElEELi16ENS2_11MakePointerEEENSA_
INSB_IKS5_Li1ELi1ElEELi16ESD_EE
nouiz commented 3 years ago

I fixed the above link error. But now compiling TF is way longer... So I'm still blocked. See this new nvbug: https://github.com/tensorflow/tensorflow/issues/48966

rpsantosa commented 3 years ago

There is more ( maybe it helps): I trained a model on images data using tensorflow 2.5.0-rc3. when making predicitnons, it resulted in Out of Memory error. Then i downgraded to 2.4.1, loading the same pre-trainned model and data ( same batch_size,etc). The predicitons were just fine, using at least, 1.5GB less of my GPU memory ( it has 8GB).

google-ml-butler[bot] commented 3 years ago

Are you satisfied with the resolution of your issue? Yes No