jfzhang95 / pytorch-video-recognition

PyTorch implemented C3D, R3D, R2Plus1D models for video activity recognition.
MIT License
1.16k stars 250 forks source link

why the training loss always none? #17

Open lucasjinreal opened 5 years ago

lucasjinreal commented 5 years ago

I got some loss like this:


100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 424/424 [04:10<00:00,  2.24it/s]
[train] Epoch: 22/100 Loss: nan Acc: 0.010870849580527
Execution time: 250.25667172999238

100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 108/108 [00:26<00:00,  5.16it/s]
[val] Epoch: 22/100 Loss: nan Acc: 0.011121408711770158
Execution time: 26.448329468010343

100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 424/424 [04:09<00:00,  2.23it/s]
[train] Epoch: 23/100 Loss: nan Acc: 0.010870849580527
Execution time: 249.90277546200377

100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 108/108 [00:26<00:00,  5.09it/s]
[val] Epoch: 23/100 Loss: nan Acc: 0.011121408711770158
Execution time: 26.87914375399123

100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 424/424 [04:09<00:00,  2.24it/s]
[train] Epoch: 24/100 Loss: nan Acc: 0.010870849580527
Execution time: 249.9237438449927

100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 108/108 [00:26<00:00,  5.16it/s]
[val] Epoch: 24/100 Loss: nan Acc: 0.011121408711770158
Execution time: 26.460865497996565

It;s all nan, for what reason maybe?

lizhongguo commented 5 years ago

This happens to me , too . the version of Pytorch is 0.4.1 . `100%|█████████████████████████████████████████████████████████████████████████████████| 423/423 [09:39<00:00, 1.34s/it] [train] Epoch: 100/100 Loss: nan Acc: 0.010874704491725768 Execution time: 579.1260393778794

100%|█████████████████████████████████████████████████████████████████████████████████| 108/108 [01:02<00:00, 2.30it/s] [val] Epoch: 100/100 Loss: nan Acc: 0.0111162575266327 Execution time: 62.677289011888206

Save model at /media/ext/lizhongguo/ActionRecognition/pytorch-video-recognition/run/run_1/models/C3D-ucf101_epoch-99.pth.tar

100%|█████████████████████████████████████████████████████████████████████████████████| 136/136 [01:16<00:00, 3.15it/s] [test] Epoch: 100/100 Loss: nan Acc: 0.010736764161421697 Execution time: 76.43733210070059 `

jfzhang95 commented 5 years ago

Hi, you may reduce the learning rate.

KyuminHwang commented 5 years ago

i also suffered from Loss:Nan.. I reduce learning rate from 1e-3 to 1e-1, but results is same(Loss : nan).

If Loss is nan, then cannot store weights. so model cant increase accuracy.... Anybody solved this problem?

lizhongguo commented 5 years ago

I checked the code from https://github.com/facebookresearch/VMZ/blob/master/lib/models/c3d_model.py , and added BatchNorm layer between Conv layer and Relu layer . Now it seems working on UCF-101 dataset .

lucasjinreal commented 5 years ago

@lizhongguo let me have a look

wave-transmitter commented 5 years ago

i also suffered from Loss:Nan.. I reduce learning rate from 1e-3 to 1e-1, but results is same(Loss : nan).

If Loss is nan, then cannot store weights. so model cant increase accuracy.... Anybody solved this problem?

Reducing learning rate means selecting a rate lower than 1e-3, such as 1e-5 or 0.5e-3. Personally I trained the model from scratch on UCF101 with learning rate equal to 1e-3, without having any NaN issues.

KyuminHwang commented 5 years ago

@wave-transmitter Thank you for comment ! i solved this problem using learning rate. i reduced learning rate to 1e-5, then it worked correctly !

ilovekj commented 5 years ago

however, when i reduce Learning rate, the acc is just 0.20, what should i do

KyuminHwang commented 5 years ago

@ilovekj i recommend to find your proper learning rate ! i control to several times, and found proper rate. how about augment your dataset ?

ilovekj commented 5 years ago

@makeastir but there is another question, it seems that they are splitting the dataset randomly, which is not allowed, there are three official splits, and when I use this code, it performance poor

KyuminHwang commented 5 years ago

@ilovekj i also used this code and i got efficient performance. In this code has augmentation module so that this code should make dataset more useful. how about increase to your dataset quantity ? In my case, Non-True is 400 , True is 150. Or reduce to features of dataset ?

ilovekj commented 5 years ago

@makeastir but you didn't use the official splits

ziqi-zhang commented 5 years ago

@ilovekj Hi. I used official split and corresponding dataloader and I only got 1% accuracy. But the same code on the random split is 98%. I wonder did you figure out the problem?

ilovekj commented 5 years ago

maybe we didn't use pretrain model, but i am not sure