hmorimitsu / ptlflow

PyTorch Lightning Optical Flow models, scripts, and pretrained weights.
Apache License 2.0
250 stars 33 forks source link

FastFlowNet - Hard and un-reproducible convergence #40

Closed magsail closed 1 year ago

magsail commented 1 year ago

Hi Henrique,

Thank you so much for sharing this collection of optical models. It helps me get on with these models quickly.

I've been training FastFlowNet on FlyingChairs dataset with your default configurations. I found that the convergence is hard and usually un-reproducible. Some times the training will converge after 16 epochs (45k steps). Sometimes the training will converge after 47 epochs (130k steps). Sometimes it will just not converge.

I'm attaching the loss curve for convergence starting with 16 epochs and 47 epochs for example.

convergence starting with 16 epochs 16 epochs convergence

convergence starting with 47 epochs 47 epochs convergences

Did you see this phenomenon when you were training the model?

Besides, I compared your loss calculation with FastFlowNet's and PWCNet's original paper. In both papers, the loss for each pyramid level was multiplied with a weight sequence.

self._weights = [0.005, 0.01, 0.02, 0.08, 0.32]

with 0.005 multiplied to the loss of the highest resolution pyramid level (which is level of 1/4 original image resolution) and 0.32 multiplied to the loss of the lowest resolution pyramid level (which is the level of 1/64 original resolution).

In your implementation, you reverse the weight sequence and replace values with proportional sequence values. ie.

self._weights = [0.32, 0.16, 0.08, 0.04, 0.02]

So in your implementation 0.32 applies to the highest resolution pyramid level.

Do you have any reasons for making this change? Is it due to the original weight sequence is even harder to converge?

Really appreciate if you can advise.

Best Regards! David

hmorimitsu commented 1 year ago

Hi David, sorry to hear about your troubles.

Unfortunately, as I said in the training docs, I don't have the resources to train and verify these models by myself, so I cannot guarantee they will train as intended. The default training is based on the RAFT routine, so I don't know how other models will behave. Based on your feedback, I think I should also make the train script print some warnings similar to the docs to inform more people about this restriction.

As for the weights, I think FastFlowNet didn't provide a training script, so I just borrowed the loss from FlowNet and didn't realize they were different. Since you mentioned this issue, I will take a look and fix accordingly, but I don't know if this can help fix your problem.

Best