running cascade_hrnet.py on the citypersons dataset

msha096 commented 3 years ago

Hi, I am facing this problem when I try to train cascade_hrnet on citypersons dataset. How to find out the tensor [1024, 256, 7, 7] that needs gradient?

 File "tools/train.py", line 98, in <module>
    main()
  File "tools/train.py", line 94, in main
    logger=logger)
  File "/home/mingzhi/Downloads/Pedestron/mmdet/apis/train.py", line 63, in train_detector
    _non_dist_train(model, dataset, cfg, validate=validate)
  File "/home/mingzhi/Downloads/Pedestron/mmdet/apis/train.py", line 228, in _non_dist_train
    runner.run(data_loaders, cfg.workflow, cfg.total_epochs)
  File "/home/mingzhi/anaconda3/envs/pedest/lib/python3.7/site-packages/mmcv/runner/runner.py", line 358, in run
    epoch_runner(data_loaders[i], **kwargs)
  File "/home/mingzhi/anaconda3/envs/pedest/lib/python3.7/site-packages/mmcv/runner/runner.py", line 271, in train
    self.call_hook('after_train_iter')
  File "/home/mingzhi/anaconda3/envs/pedest/lib/python3.7/site-packages/mmcv/runner/runner.py", line 229, in call_hook
    getattr(hook, fn_name)(self)
  File "/home/mingzhi/Downloads/Pedestron/mmdet/core/my_mmcv/runner/hooks/mean_teacher_optimizer.py", line 18, in after_train_iter
    runner.outputs['loss'].backward()
  File "/home/mingzhi/anaconda3/envs/pedest/lib/python3.7/site-packages/torch/tensor.py", line 118, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph)
  File "/home/mingzhi/anaconda3/envs/pedest/lib/python3.7/site-packages/torch/autograd/__init__.py", line 93, in backward
    allow_unreachable=True)  # allow_unreachable flag
RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.cuda.FloatTensor [1024, 256, 7, 7]], which is output 0 of IndexPutBackward, is at version 4; expected version 3 instead. Hint: enable anomaly detection to find the operation that failed to compute its gradient, with torch.autograd.set_detect_anomaly(True).

hasanirtiza commented 3 years ago

Hi, I cannot reproduce this issue locally. From a quick search, it appears that in some cases it had something to do with your PyTorch version. We have trained HRNet with Cascade locally several times and without any problems. May be have a look here and see if it helps.

msha096 commented 3 years ago

Hi, I cannot reproduce this issue locally. From a quick search, it appears that in some cases it had something to do with your PyTorch version. We have trained HRNet with Cascade locally several times and without any problems. May be have a look here and see if it helps.

What version of PyTroch are you using? It seems v1.2 has this problem...

hasanirtiza commented 3 years ago

1.1.0

hasanirtiza / Pedestron

running cascade_hrnet.py on the citypersons dataset #67