tcmalloc: large alloc on Colab and Tensorflow killed on local machine due to over consumption of RAM

arunumd commented 5 years ago

System information

What is the top-level directory of the model you are using: /home
Have I written custom code (as opposed to using a stock example script provided in TensorFlow): Yes
OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Ubuntu 18.04
TensorFlow installed from (source or binary): Binary
TensorFlow version (use command below): 1.9.0
Bazel version (if compiling from source): N/A
CUDA/cuDNN version: 10.1.243
GPU model and memory: NVIDIA Quadro RTX 5000; and 16 GB RAM
Exact command to reproduce: I ran the following code in an ipython notebook in both my local machine (local GPU) and Google Colab :
```
!git clone https://github.com/charlesq34/pointnet.git
cd pointnet/sem_seg/
!sh download_data.sh
!python train.py --log_dir log6 --test_area 6
```
Describe the problem

The tensorflow API always tries to consume the maximum RAM even when I have a GPU and the kernel gets killed while training my deep learning algorithm. I referred online on multiple sources (1, 2, 3, 4, 5, 6) and tried the following things :

Reduce the batch size
Change the optimizer from adam to momentum

However, none of these suggestions helped to solve the problem.

Source code / logs

The error log is very long and hence I am attaching it in a separate text file here : ERROR_LOG.txt

rolba commented 4 years ago

Hello. Be sure that you reduced your bath size well. I had the same issue with my code: https://github.com/rolba/ai-nimals/blob/master/ai_nimals_train_alexnet.py Reducing bath to 32 for generators did the job. Moreover, I paid attention to my RAM memory while training using htop in the console. When SWAP starts to overflow it was a sign for me that I am having a problem with my bath size.

You can find hdf5 generators on my github account. Please check them, use them and let me know if you are still having problems.
Br. Pawel

PrakashSuthar commented 4 years ago

Hello, I get the tcmalloc error very often when trying to run the code on colab from python files ( say train.py ) but the same code(content of train.py copied to cell) when run from the cell gives no such error.I would like to know the cause behind such a behaviour.

ravikyram commented 4 years ago

@arunumd

Is this still an issue?.Please, close this thread if your issue was resolved.Thanks!

arunumd commented 4 years ago

@ravikyram Yes. This is still the same issue

ravikyram commented 4 years ago

@arunumd

Please, let us know which pretrained model you are using and share related code .Thanks!

entorius commented 3 years ago

For example this issue still persists when i try to run https://github.com/dorarad/gansformer this model. I'm using Tensorflow 1.15.0 Google colab on GPU

tensorflow / models

tcmalloc: large alloc on Colab and Tensorflow killed on local machine due to over consumption of RAM #7652

System information

Describe the problem

Source code / logs