Loss goes to nan after 6 iterations

pvnieo / SURFMNet-pytorch

A pytorch implementation of: "Unsupervised Deep Learning for Structured Shape Matching"

MIT License

16 stars 2 forks source link

Loss goes to nan after 6 iterations #4

Closed hearables-pkinsella closed 3 years ago

hearables-pkinsella commented 3 years ago

I'm following through your examples and am having issues when I start training.

It starts off fine but after 6 iterations the loss turns to nan. Do you have any guidance on what could be causing this?

pvnieo commented 3 years ago

Hi @hearables-pkinsella ,

When did you clone the repo?

hearables-pkinsella commented 3 years ago

I cloned it yesterday. After reprocessing the training data the issue with the nans has gone away. The loss decreases from 100 to about 9. I had an issue where the loss then peaked back up to 60 after 5 epochs. I ran some tests on the results and its not working when compared to the tensorflow implementation. Do you have a sample test script that you use?

pvnieo commented 3 years ago

Did you train it using the same hyperparameters and epochs as the the tensorflow version? how did you evaluated the tensorflow version? normally, this should be the same as pytorch.

Also, the loss is not stable is normal for this network!

hearables-pkinsella commented 3 years ago

That's good to know that the loss not being stable is normal. I tested originally with your default parameters and tested the parameters from the original paper but the output result is still the same.

I am evaluating visually by generating a P2P map and looking at the correspondences. Our database is made of quite smooth curvature shapes that are all in the same coordinate system. I have attached a sample of our dataset, if you have time it would be appreciated if you can give some feedback on tuning the parameters.

3105_R.zip

hearables-pkinsella commented 3 years ago

So it turns out increasing the radius for the shot descriptor was the key to get some better results.

I noticed one issue that may be in my test script or in the model itself, if I set the torch model to eval, then the output is just the same number. Whereas if I set the torch model to train, then I get the correct correspondences out.

pvnieo commented 3 years ago

What batch size are you using during training?

pvnieo commented 3 years ago

Closing the issue! feel free to open it if you have further questions!