Resource exhausted error

taki0112 / BigGAN-Tensorflow

Simple Tensorflow implementation of "Large Scale GAN Training for High Fidelity Natural Image Synthesis" (BigGAN)

MIT License

262 stars 75 forks source link

Resource exhausted error #12

Open ebranda opened 5 years ago

ebranda commented 5 years ago

Thanks for contributing this. About a minute into each training run I am receiving the following error, after which the program exits: (1) Resource exhausted: OOM when allocating tensor with shape[256,192,64,64] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc

Also, when initializing, the program reports the following: [] Reading checkpoints... [] Failed to find a checkpoint [!] Load failed... But continues to run.

I have reduced batch_size to 256 and img_size to 128 and error persists. Running Tensorflow version 1.14.0.

Any ideas?

xielongze commented 4 years ago

I'm getting the same error.

Rafay2919 commented 4 years ago

i am also getting the same error.

Rafay2919 commented 4 years ago

i am also getting the same error.

I solved this problem by reducing the batch size, number of iterations etc.

xiaowangzi6668 commented 4 years ago

我也遇到同样的错误。

我通过减少批处理，迭代次数等来解决了这个问题。

请问您跑通了么这个代码

manvirvirk commented 4 years ago

i am also getting the same error.

I solved this problem by reducing the batch ,size number of iterations etc.

I m getting same error. Can you please exactly what all changes have you made??

manvirvirk commented 4 years ago

我也遇到同样的错误。

我通过减少批处理，迭代次数等来解决了这个问题。

请问您跑通了么这个代码

@xiaowangzi6668 i m still getting same error. Can you please tell me exactly what changes to be made? Thanks

manvirvirk commented 4 years ago

i am also getting the same error.

I solved this problem by reducing the batch ,size number of iterations etc.

i reduced batch size and iteration number still getting the error. Can you please tell exactly what changes you made. Thanks

sieu-n commented 3 years ago

@manvirvirk Resource exhausted error literally means you used all possible RAM space in your local environment. Try training in a better environment. It will work if you set the batch size small as an extreme size as 2.

MegaCreater commented 3 years ago

@manvirvirk Resource exhausted error literally means you used all possible RAM space in your local environment. Try training in a better environment. It will work if you set the batch size small as an extreme size as 2.

Nope. Its not like that. I am using google colab pro with 25 GB of RAM (even ram is not fully occupied), still i got this error.

Actually its depending on your GPU architecture. Internally GPU contains different type of cores like TF32, FP64. When these cores are not enough to get assigned work (threads) by CUDA we got OOM (Out of Memory) error.

@ebranda Soultion -> buy new a GPU (one or multiple) with large number of CUDA cores or to reduce batch size to that extend till this error get resolve. [batch size like 128, 64, 32, 16, 8, 4, 2].

NOTE: Reducing batch-size may highly effect your model performance as it is said that bigGAN give better results with large batch size (that's why default batch-size is 2048).