PSNR: nan (Best: nan @epoch 1)

mwk0423 commented 5 years ago

hello: i'm new hand in SR.The operating system I am used win10 and CUDA v8.0. i notice that the algorithm maybe run in Linux, therefore i install Cygwin so that can execute linux command in my copmuter. when i execute the algorithm follow by ‘README’ ,there some error about"Device or resource busy",then i reference answer @kice in #50 and alter my “dataloader.py”. It can run but anoter problem emerge.I posted some of the results below.

[Epoch 102] Learning rate: 1.00e-4

Evaluation: 0it [00:00, ?it/s] [DIV2K x2] PSNR: nan (Best: nan @epoch 1) Forward: 0.00s

Saving... Total: 0.13s

[Epoch 103] Learning rate: 1.00e-4

Evaluation: 0it [00:00, ?it/s] [DIV2K x2] PSNR: nan (Best: nan @epoch 1) Forward: 0.00s

Saving... Total: 0.14s

[Epoch 104] Learning rate: 1.00e-4

Evaluation: 0it [00:00, ?it/s] [DIV2K x2] PSNR: nan (Best: nan @epoch 1) Forward: 0.00s

Saving... Total: 0.14s

[Epoch 105] Learning rate: 1.00e-4

Evaluation: 0it [00:00, ?it/s] [DIV2K x2] PSNR: nan (Best: nan @epoch 1) Forward: 0.00s

Saving... Total: 0.13s

[Epoch 106] Learning rate: 1.00e-4

Evaluation: 0it [00:00, ?it/s] [DIV2K x2] PSNR: nan (Best: nan @epoch 1) Forward: 0.00s

Saving... Total: 0.17s

[Epoch 107] Learning rate: 1.00e-4

Evaluation: 0it [00:00, ?it/s] [DIV2K x2] PSNR: nan (Best: nan @epoch 1) Forward: 0.02s

Saving... Total: 0.14s

[Epoch 108] Learning rate: 1.00e-4

Evaluation: 0it [00:00, ?it/s] [DIV2K x2] PSNR: nan (Best: nan @epoch 1) Forward: 0.02s

Saving... Total: 0.16s

The target of PSNR always be nan... and training model was not saved(only some parameter files saved).

I would be very grateful if there have any suggestions about my problem.

YongboLiang commented 5 years ago

I have also encountered this problem. After I changed the dir_data to the folder location of my dataset , the problem was solved. parser.add_argument('--dir_data', type=str, default='../../dataset', help='dataset directory')

mwk0423 commented 5 years ago

You are right. i change my folder location to '../../dataset'(before it was '../.../dataset/DIV2K') and it training normally... Except for some extra memory errors -_-! such as

Preparing loss function: 1.000 * L1 [Epoch 1] Learning rate: 1.00e-4 [80/800] [L1: 27.2510] 15.3+0.3s [160/800] [L1: 20.0985] 14.6+0.2s [240/800] [L1: 16.4675] 14.7+0.2s [320/800] [L1: 14.4444] 14.7+0.2s [400/800] [L1: 12.9677] 14.7+0.2s [480/800] [L1: 11.9202] 14.8+0.1s [560/800] [L1: 11.2859] 14.8+0.2s [640/800] [L1: 10.9254] 14.8+0.2s [720/800] [L1: 10.4220] 14.9+0.2s [800/800] [L1: 9.9505] 14.9+0.2s

Evaluation: 10%|████▍ | 1/10 [00:04<00:41, 4.65s/it]THCudaCheck FAIL file=c:\users\admi nistrator\downloads\new-builder\win-wheel\pytorch\aten\src\thc\generic/THCStorage.cu line=58 error=2 : out of memory

Traceback (most recent call last): File "main.py", line 26, in t.test() File "C:\Users\Administrator\Desktop\SR\EDSR-PyTorch-master\src\trainer.py", line 93, in test sr = self.model(lr, idx_scale) File "C:\Program Files (x86)\Microsoft Visual Studio\Shared\Python36_64\lib\site-packages\torch\nn\modules\module.py", line 491, in call result = self.forward(*input, kwargs) File "C:\Users\Administrator\Desktop\SR\EDSR-PyTorch-master\src\model__init.py", line 53, in forward return self.model(x) File "C:\Program Files (x86)\Microsoft Visual Studio\Shared\Python36_64\lib\site-packages\torch\nn\modules\module.py", line 491, in call result = self.forward(*input, **kwargs) File "C:\Users\Administrator\Desktop\SR\EDSR-PyTorch-master\src\model\edsr.py", line 58, in forward x = self.tail(res) File "C:\Program Files (x86)\Microsoft Visual Studio\Shared\Python36_64\lib\site-packages\torch\nn\modules\module.py", line 491, in call__ result = self.forward(*input, kwargs) File "C:\Program Files (x86)\Microsoft Visual Studio\Shared\Python36_64\lib\site-packages\torch\nn\modules\container.p y", line 91, in forward input = module(input) File "C:\Program Files (x86)\Microsoft Visual Studio\Shared\Python36_64\lib\site-packages\torch\nn\modules\module.py", line 491, in call result = self.forward(*input, *kwargs) File "C:\Program Files (x86)\Microsoft Visual Studio\Shared\Python36_64\lib\site-packages\torch\nn\modules\container.p y", line 91, in forward input = module(input) File "C:\Program Files (x86)\Microsoft Visual Studio\Shared\Python36_64\lib\site-packages\torch\nn\modules\module.py", line 491, in call result = self.forward(input, **kwargs) File "C:\Program Files (x86)\Microsoft Visual Studio\Shared\Python36_64\lib\site-packages\torch\nn\modules\pixelshuffl e.py", line 40, in forward return F.pixel_shuffle(input, self.upscale_factor) File "C:\Program Files (x86)\Microsoft Visual Studio\Shared\Python36_64\lib\site-packages\torch\nn\functional.py", lin e 1662, in pixel_shuffle shuffle_out = input_view.permute(0, 1, 4, 2, 5, 3).contiguous() RuntimeError: cuda runtime error (2) : out of memory at c:\users\administrator\downloads\new-builder\win-wheel\pytorch\a ten\src\thc\generic/THCStorage.cu:58

I think it probability that some parameters are set incorrectly would you like to tell me something about this.

YongboLiang commented 5 years ago

maybe your gpu memory isn't enough and you can try change n_resblocks to 8 (even smaller) or change dataset to set14(the size of images in set14 is smaller and need less GPU memory )

mwk0423 commented 5 years ago

I modified some parameters and it effective thanks a lot. @YongboLiang

qiufengyuzhi commented 11 hours ago

Excuse me, I referred to the above conversation to modify my code, but still haven't solved the problem of the PSNR value being nan. I think the model still hasn't read the dataset. I would be very grateful if there have any suggestions about my problem.

(F:\yolov7) C:\Users\Lenovo\Desktop\Non-Local-Sparse-Attention-main\src>python main.py --dir_data "../../datasets" --n_GPUs 1 --rgb_range 1 --chunk_size 144 --n_hashes 4 --save_models --lr 1e-4 --decay 200-400-600-800 --epochs 1000 --c hop --save_results --n_resblocks 32 --n_feats 256 --res_scale 0.1 --batch_size 16 --model NLSN --scale 4 --patch_size 96 --save NLSN_x4 --data_train DIV2K

sanghyun-son / EDSR-PyTorch

PSNR: nan (Best: nan @epoch 1) #79