visinf / irr

Iterative Residual Refinement for Joint Optical Flow and Occlusion Estimation (CVPR 2019)
Apache License 2.0
194 stars 32 forks source link

why so many issues are closed without being solved? #45

Closed poincarelee closed 2 years ago

poincarelee commented 2 years ago

I tried to run your irr code, but met so many problems, most of which are related to the pytorch/cuda version and corelation package. I have searched through google and other engines and tried almost every exsiting method(including cuda 8/9/10, torch 0.4.1/1.0.0/1.5.0), unfortunately still could not run normally. I saw many version problems are included in closed issues without being solved, I was wondering whether you have checked your code and why you closed so many questions. If there are bugs in your code, you need to check your code and solve the problems instead of covering them.

hurjunhwa commented 2 years ago

Oh, relax and don’t get so angry 😅. Sorry if it didn’t solve your problem! I closed the issues if people are satisfied with it or they don’t reply on my answer for a long time — presuming that it resolved their problems. Which issues do you think I covered? Let’s open them again and continue discussing it.

poincarelee commented 2 years ago

Sorry for being a little angrier after several night's digging in Google. Ok, I did met a lot of problems. I will list some of them as follows: 1. about correlation_package, I didn't find a right way yo solve it, finally I copy this package from flownet2 and compile to get around the problem.
2. Something is wrong with "_model.feature_pyramid_extractor", such as 'unexpected key "_model.feature_pyramid_extractor.convs.0.0.0.weight" in state_dict'. A prefix "module." may need to be added. 3. multi_gpu training code couldn't work. Two errors occurred: 1) "RuntimeError: Expected tensor for argument #1 ‘input’ to have the same device as tensor for argument #2 ‘weight’; but device 0 does not equal 2 (while checking arguments for cudnn_convolution)](https://discuss.pytorch.org/t/runtimeerror-expected-tensor-for-argument-1-input-to-have-the-same-device-as-tensor-for-argument-2-weight-but-device-0-does-not-equal-2-while-checking-arguments-for-cudnn-convolution/58634)", this may be something wrong in the network 2) during debuging 1), a new error occurred as follows:

image

Problem 2 needed to be checked. For multi_gpu training problem as such, if it's convenient for you, maybe you could remedy it, if not, this can be done by users . What's more important, if you have experimented in a steady version of pytorch/cuda/python, I think it's better to clarify. Through these days' experience of mine, pytorch 1.0/cuda 10.0/python 3.6 would be a good choice.

hurjunhwa commented 2 years ago

Thanks for sharing the details!

Correlation layer This has been a tricky part because the CUDA implementation has a dependency on both CUDA and Pytorch versions. The best way so far is to get rid of all the dependency by using a pure-python implementation (https://github.com/visinf/irr/issues/43). Please pull the latest commit, and try to run again.

Model prefix I think I know where the problem comes from. I will check the source code again.

Multi-GPU Did you use the DataParallel or DDP for the multi-GPU training? Please provide more details on how you tried (which model, scripts, dataloader, etc you tried?). Without knowing them, it's difficult to guess only by looking at the error log.

Version As described in the readme file, I checked that the basic functionally works on PyTorch 1.5. If you switch the correlation layer to the python implementation, I think the cuda-version dependency will be gone, as long as the PyTorch is properly installed.

poincarelee commented 2 years ago

Ok, thanks for your quick reply. I will try the pure-python implementation of correlation package.

As for multi-gpu, I used the code you write in main.py (just remove the comments) as follows: image I used the IRR-PWC for training on sintel dataset. device_ids=[0,1,2,3] During debugging, example_dict is on gpu 1, model is on gpu 0. It's weird.

yes, most of the version problems are about correlation_package, implementation of pure python would avoid these problems, I will try.

poincarelee commented 2 years ago

I've solved multi-gpu training. This issue can be closed.

hurjunhwa commented 2 years ago

Is it because you assign the model and loss to the GPU idx 0? model_and_loss._model.cuda(device_ids[0])