NaN after few steps loss

esdrascosta commented 3 years ago

Hi, thanks for posting these codes. I'm trying to replicate the results, but I'm getting NaN after 11 steps. I installed all the dependencies in the described versions, but I still have these results. Please find below the log results

$ python train.py --obj zipper --data_path ./data/mvtech_anomaly --batch_size 2  

{'alpha': 1.0, 'batch_size': 2, 'belta': 1.0, 'data_path': ' ./data/mvtech_anomaly', 'data_type': 'mvtec', 'epochs': 300, 'gamma': 1.0, 'grayscale': False, 'img_size': 256, 'input_channel': 3, 'k_value': [2, 4, 8, 16], 'lr': 0.0001, 'obj': 'zipper', 'prefix': '2020-12-03-1197', 'save_dir': './mvtec/zipper/seed_2988/', 'seed': 2988, 'validation_ratio': 0.2, 'weight_decay': 1e-05}
   1/300 ----- [[2020-12-03 23:30:45]] [Need: 00:00:00]
  0%|                                                                                                         | 0/96 [00:00<?, ?it/s]Step Loss: 1.779465
  1%|█                                                                                                | 1/96 [00:02<03:22,  2.13s/it]Step Loss: 1.835103
  2%|██                                                                                               | 2/96 [00:03<02:52,  1.83s/it]Step Loss: 1.479402
  3%|███                                                                                              | 3/96 [00:04<02:36,  1.69s/it]Step Loss: 1.401773
  4%|████                                                                                             | 4/96 [00:05<02:26,  1.59s/it]Step Loss: 1.448756
  5%|█████                                                                                            | 5/96 [00:07<02:13,  1.46s/it]Step Loss: 1.693701
  6%|██████                                                                                           | 6/96 [00:08<02:02,  1.36s/it]Step Loss: 1.229446
  7%|███████                                                                                          | 7/96 [00:09<02:00,  1.36s/it]Step Loss: 1.215524
  8%|████████                                                                                         | 8/96 [00:10<02:00,  1.36s/it]Step Loss: 1.493567
  9%|█████████                                                                                        | 9/96 [00:12<01:52,  1.29s/it]Step Loss: 1.430892
 10%|██████████                                                                                      | 10/96 [00:13<01:46,  1.24s/it]Step Loss: 1.118710
 11%|███████████                                                                                     | 11/96 [00:14<01:48,  1.28s/it]Step Loss: nan
 12%|████████████                                                                                    | 12/96 [00:15<01:43,  1.23s/it]Step Loss: nan
 14%|█████████████                                                                                   | 13/96 [00:16<01:41,  1.22s/it]

plutoyuxie commented 3 years ago

@esdrascosta Same results occur time to time, and we are trying to find out the reason too. Maybe you can just have a cup of coffee and simply try it again at present.

MDAooo commented 3 years ago

Hi @plutoyuxie, Thanks for sharing your codes. I was also working on the implementation of RIAD. Your codes are great and helped me a lot. I really appreciate that.

Hi @esdrascosta , I met this problems before. I fixed it by modifying 'x = torch.sqrt(x + sys.float_info.epsilon)' at line 27 in gms_loss.py, then I've never met the NaN loss again. I think the problem is caused by 0 value when calculating the derivative. You can try this modification. I hope it helps.

BTW, have you ever tried to train ONE reconstruction model for multiple objects? I am trying this but the reconstruction results is not as good as single object.

plutoyuxie commented 3 years ago

Thanks, @MaDongao I will try it soon. Reconstructing multiple objects is much harder. As I know, the state-of-the-art method is called PaDiM, which is not a reconstruction method.

taikiinoue45 commented 3 years ago

@plutoyuxie The following library might be helpful for your implementation. https://github.com/photosynthesis-team/piq

plutoyuxie / Reconstruction-by-inpainting-for-visual-anomaly-detection

NaN after few steps loss #1