Training stops without an error

andrzejbro commented 3 years ago

Hi,

I am able to train a model using _notcuda flag, but it takes a long time.

However, when I run _maintrain.py normally, it shows a model summary, prints "Epoch 0 Dsteps: 3" and stops with no error message.

I tried to locate how far it goes, and it stops somewhere near line _errD_real.backward(retaingraph=True) in file training.py but I don't know what may be wrong.

I would be very thankful for the help.

tamarott commented 3 years ago

Do you have a cuda machine (any GPU)? Otherwise the code doesn't suppose to run without the not_cuda flag.

andrzejbro commented 3 years ago

I do have an NVIDIA GPU which is visible for PyTorch. I also have CUDA 10.1 installed and running.

bigbroliuLa commented 2 years ago

same, i also have this problem. When I use the cuda (or i mean i do not spicify --not_cuda in the code), the training process stopped very quick, and only output one image under 0 scale folder. But if I specify --not_cuda, then the training process goes very nice. And my computer is windows 11 operation system, and I have a GTX 2080 graphic card with Cuda installed. So, I am woundering what happened. I paste my result code here:

PS C:\liwen> python main_train.py --input_name balloons.png Random Seed: 5184 GeneratorConcatSkip2CleanAdd( (head): ConvBlock( (conv): Conv2d(3, 32, kernel_size=(3, 3), stride=(1, 1)) (norm): BatchNorm2d(32, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (LeakyRelu): LeakyReLU(negative_slope=0.2, inplace=True) ) (body): Sequential( (block1): ConvBlock( (conv): Conv2d(32, 32, kernel_size=(3, 3), stride=(1, 1)) (norm): BatchNorm2d(32, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (LeakyRelu): LeakyReLU(negative_slope=0.2, inplace=True) ) (block2): ConvBlock( (conv): Conv2d(32, 32, kernel_size=(3, 3), stride=(1, 1)) (norm): BatchNorm2d(32, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (LeakyRelu): LeakyReLU(negative_slope=0.2, inplace=True) ) (block3): ConvBlock( (conv): Conv2d(32, 32, kernel_size=(3, 3), stride=(1, 1)) (norm): BatchNorm2d(32, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (LeakyRelu): LeakyReLU(negative_slope=0.2, inplace=True) ) ) (tail): Sequential( (0): Conv2d(32, 3, kernel_size=(3, 3), stride=(1, 1)) (1): Tanh() ) ) WDiscriminator( (head): ConvBlock( (conv): Conv2d(3, 32, kernel_size=(3, 3), stride=(1, 1)) (norm): BatchNorm2d(32, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (LeakyRelu): LeakyReLU(negative_slope=0.2, inplace=True) ) (body): Sequential( (block1): ConvBlock( (conv): Conv2d(32, 32, kernel_size=(3, 3), stride=(1, 1)) (norm): BatchNorm2d(32, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (LeakyRelu): LeakyReLU(negative_slope=0.2, inplace=True) ) (block2): ConvBlock( (conv): Conv2d(32, 32, kernel_size=(3, 3), stride=(1, 1)) (norm): BatchNorm2d(32, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (LeakyRelu): LeakyReLU(negative_slope=0.2, inplace=True) ) (block3): ConvBlock( (conv): Conv2d(32, 32, kernel_size=(3, 3), stride=(1, 1)) (norm): BatchNorm2d(32, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (LeakyRelu): LeakyReLU(negative_slope=0.2, inplace=True) ) ) (tail): Conv2d(32, 1, kernel_size=(3, 3), stride=(1, 1)) ) PS C:\liwen>

So, you see that the code stop running very quick without any training. I checked my cuda setting many times but could not find any problem. So, is this caused by my operation system version which is windows 11? or something else is wrong?

bigbroliuLa commented 2 years ago

and there is no error shown during the run.

bigbroliuLa commented 2 years ago

i finally fixed this bby installing the torch 1.4.0 and torchvision 0.5.0 with correct cuda version (cuda 10.1). Note: I mean the torch and torchvision should be installed as compatible version with cuda 10.1 since they are dependent on cuda 10.1. Also, you should remove some unused package related to cuda or torch, this may be helpful.

Check on website to see the installation package or forum.

I tried to solve this problem for two days, truly frustrating. And the OS I use is Winsows 11, a trush version that definitely the worse for deep learning, so I believe you will want to use something like linux.

tamarott / SinGAN

Training stops without an error #138