RuntimeError: CUDA out of memory

ahmadxon commented 4 years ago

First of all, thank you for your wonderful work. I am training animation.py and after scale 7 I am getting this error. How can I solve it? Thanks!

scale 7:[1975/2000] scale 7:[1999/2000] GeneratorConcatSkip2CleanAdd( (head): ConvBlock( (conv): Conv2d(3, 128, kernel_size=(3, 3), stride=(1, 1)) (norm): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (LeakyRelu): LeakyReLU(negative_slope=0.2, inplace=True) ) (body): Sequential( (block1): ConvBlock( (conv): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1)) (norm): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (LeakyRelu): LeakyReLU(negative_slope=0.2, inplace=True) ) (block2): ConvBlock( (conv): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1)) (norm): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (LeakyRelu): LeakyReLU(negative_slope=0.2, inplace=True) ) (block3): ConvBlock( (conv): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1)) (norm): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (LeakyRelu): LeakyReLU(negative_slope=0.2, inplace=True) ) ) (tail): Sequential( (0): Conv2d(128, 3, kernel_size=(3, 3), stride=(1, 1)) (1): Tanh() ) ) WDiscriminator( (head): ConvBlock( (conv): Conv2d(3, 128, kernel_size=(3, 3), stride=(1, 1)) (norm): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (LeakyRelu): LeakyReLU(negative_slope=0.2, inplace=True) ) (body): Sequential( (block1): ConvBlock( (conv): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1)) (norm): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (LeakyRelu): LeakyReLU(negative_slope=0.2, inplace=True) ) (block2): ConvBlock( (conv): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1)) (norm): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (LeakyRelu): LeakyReLU(negative_slope=0.2, inplace=True) ) (block3): ConvBlock( (conv): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1)) (norm): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (LeakyRelu): LeakyReLU(negative_slope=0.2, inplace=True) ) ) (tail): Conv2d(128, 1, kernel_size=(3, 3), stride=(1, 1)) ) Traceback (most recent call last): File "main_train.py", line 29, in train(opt, Gs, Zs, reals, NoiseAmp) File "C:\Users\Wooks\Source\ml_khan_20185057\SinGAN\SinGAN\training.py", line 39, in train z_curr,in_s,G_curr = train_single_scale(D_curr,G_curr,reals,Gs,Zs,in_s,NoiseAmp,opt) File "C:\Users\Wooks\Source\ml_khan_20185057\SinGAN\SinGAN\training.py", line 162, in train_single_scale gradient_penalty.backward() File "C:\Users\Wooks\AppData\Local\Programs\Python\Python36\lib\site-packages\torch\tensor.py", line 166, in backward torch.autograd.backward(self, gradient, retain_graph, create_graph) File "C:\Users\Wooks\AppData\Local\Programs\Python\Python36\lib\site-packages\torch\autograd__init__.py", line 99, in backward allow_unreachable=True) # allow_unreachable flag RuntimeError: CUDA out of memory. Tried to allocate 22.00 MiB (GPU 0; 2.00 GiB total capacity; 1.14 GiB already allocated; 9.49 MiB free; 177.34 MiB cached)

markstrefford commented 4 years ago

You’re GPU doesn’t have enough memory for this, so either train a smaller image or move to a bigger GPU.

On 16 Dec 2019, at 04:43, Ahmadxon notifications@github.com wrote:

First of all, thank you for your wonderful work. I am training animation.py and after scale 7 I am getting this error. How can I solve it? Thanks!

scale 7:[1975/2000] scale 7:[1999/2000] GeneratorConcatSkip2CleanAdd( (head): ConvBlock( (conv): Conv2d(3, 128, kernel_size=(3, 3), stride=(1, 1)) (norm): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (LeakyRelu): LeakyReLU(negative_slope=0.2, inplace=True) ) (body): Sequential( (block1): ConvBlock( (conv): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1)) (norm): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (LeakyRelu): LeakyReLU(negative_slope=0.2, inplace=True) ) (block2): ConvBlock( (conv): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1)) (norm): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (LeakyRelu): LeakyReLU(negative_slope=0.2, inplace=True) ) (block3): ConvBlock( (conv): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1)) (norm): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (LeakyRelu): LeakyReLU(negative_slope=0.2, inplace=True) ) ) (tail): Sequential( (0): Conv2d(128, 3, kernel_size=(3, 3), stride=(1, 1)) (1): Tanh() ) ) WDiscriminator( (head): ConvBlock( (conv): Conv2d(3, 128, kernel_size=(3, 3), stride=(1, 1)) (norm): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (LeakyRelu): LeakyReLU(negative_slope=0.2, inplace=True) ) (body): Sequential( (block1): ConvBlock( (conv): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1)) (norm): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (LeakyRelu): LeakyReLU(negative_slope=0.2, inplace=True) ) (block2): ConvBlock( (conv): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1)) (norm): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (LeakyRelu): LeakyReLU(negative_slope=0.2, inplace=True) ) (block3): ConvBlock( (conv): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1)) (norm): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (LeakyRelu): LeakyReLU(negative_slope=0.2, inplace=True) ) ) (tail): Conv2d(128, 1, kernel_size=(3, 3), stride=(1, 1)) ) Traceback (most recent call last): File "main_train.py", line 29, in train(opt, Gs, Zs, reals, NoiseAmp) File "C:\Users\Wooks\Source\ml_khan_20185057\SinGAN\SinGAN\training.py", line 39, in train z_curr,in_s,G_curr = train_single_scale(D_curr,G_curr,reals,Gs,Zs,in_s,NoiseAmp,opt) File "C:\Users\Wooks\Source\ml_khan_20185057\SinGAN\SinGAN\training.py", line 162, in train_single_scale gradient_penalty.backward() File "C:\Users\Wooks\AppData\Local\Programs\Python\Python36\lib\site-packages\torch\tensor.py", line 166, in backward torch.autograd.backward(self, gradient, retain_graph, create_graph) File "C:\Users\Wooks\AppData\Local\Programs\Python\Python36\lib\site-packages\torch\autogradinit.py", line 99, in backward allow_unreachable=True) # allow_unreachable flag RuntimeError: CUDA out of memory. Tried to allocate 22.00 MiB (GPU 0; 2.00 GiB total capacity; 1.14 GiB already allocated; 9.49 MiB free; 177.34 MiB cached)

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or unsubscribe.

15732031137 commented 4 years ago

Hello! I encountered the same problem when running the main_train.py file, but it appeared after adding a layer of attention mechanism to the network of the generator and the discriminator. I did not encounter any problems when running the original code. Does adding only a layer of attention mechanism cause insufficient GPU memory? Thank you and wish you a happy life!

luiscarbonell commented 4 years ago

@markstrefford ...ran into a similar issue; have 6 GiB memory - training on a 1024x1024 pixels image...

victorca25 commented 4 years ago

Hello! I encountered the same problem when running the main_train.py file, but it appeared after adding a layer of attention mechanism to the network of the generator and the discriminator. I did not encounter any problems when running the original code. Does adding only a layer of attention mechanism cause insufficient GPU memory? Thank you and wish you a happy life!

Attention layers consume a lot of memory. You can try using pooling or another mechanism to reduce the attention matrix size to reduce the memory usage

15732031137 commented 4 years ago

@victorca25 Thank you for your idea, which has benefited me a lot. Wish you a happy life！

ZhiyuanLck commented 4 years ago

I want to know why the memory is increasing when training model on a finer scale. Because parameters of previous model is fixed, so I wonder about the increasing memory.

prashanth31 commented 3 years ago

@ahmadxon : How did you solve the "out of memory" error?

ahmadxon commented 3 years ago

@ahmadxon : How did you solve the "out of memory" error?

I just used Google Colab platform and runed there.

tamarott / SinGAN

RuntimeError: CUDA out of memory #64