Training issue - Githubissues

tarun005 / FLAVR

Code for FLAVR: A fast and efficient frame interpolation technique.

Apache License 2.0

455 stars 69 forks source link

Training issue #45

Closed issakh closed 6 months ago

issakh commented 2 years ago

Hi, I've been trying to train this network on an A100 GPU. However, as torch 1.5.0 doesn't support this GPU I am forced to use torch 1.9.0. The training is broken for torch versions>1.5.0 but cannot find the reason why. I have looked at the differences between the torch versions, however, nothing is clear as to why this happens. Basically, the model stays stuck at around 20dB for the duration of training. I previously tested this code on a 1080Ti with torch 1.5.0 and that worked fine. But due to memory constraints and training time, the A100 would be the better option. Do you have any idea why this occurs and any possible solutions?

Thanks

weiMytian commented 2 years ago

Hi, I've been trying to train this network on an A100 GPU. However, as torch 1.5.0 doesn't support this GPU I am forced to use torch 1.9.0. The training is broken for torch versions>1.5.0 but cannot find the reason why. I have looked at the differences between the torch versions, however, nothing is clear as to why this happens. Basically, the model stays stuck at around 20dB for the duration of training. I previously tested this code on a 1080Ti with torch 1.5.0 and that worked fine. But due to memory constraints and training time, the A100 would be the better option. Do you have any idea why this occurs and any possible solutions?

Hi, I have the same problem as you. The test results are very good, but the PSNR is kept at about 17 during training. Has your problem been solved

issakh commented 2 years ago

Hi, I have not been able to find a solution to this. Tried writing the training code for this, but the same issues arose when adding the section for validation. Looked at the release logs to see the difference between torch 1.5.1 (the maximum version the flavr code works on) and 1.6.0 and none of the new additions or depreciations explain why the training no longer works on newer torch versions

tarun005 commented 2 years ago

@issakh Can you make sure that the versions of the PyTorch are the same as recommended? I did not face any such issues at my end.

issakh commented 2 years ago

Hi, so the problem is I can't have the recommended version because my GPU doesn't support the cuda version torch 1.5.0 requires That is the current problem I'm facing and the weird thing is I'm not sure what causes the code to break!