big difference after converting model to pytorch

noahzn commented 9 months ago

Hi, I use the convert_to_pytorch code to convert my TF model. The results are very different.

TF version
Pytorch version

I used the default parameters. Could you help me with that? Thank you. @rpautrat @sarlinpe

sarlinpe commented 9 months ago

The TF output looks very different from the one in the convert_to_pytorch.ipynb notebook - did you change any parameter there? If not, this is very surprising. I used tensorflow==1.15.0 and torch==1.13.1+cu117, what versions are you using? You might need to compare the results layer by layer to find out where the discrepancy comes from - I recommend starting from the raw score map.

noahzn commented 9 months ago

The TF output looks very different from the one in the convert_to_pytorch.ipynb notebook - did you change any parameter there? If not, this is very surprising. I used tensorflow==1.15.0 and torch==1.13.1+cu117, what versions are you using? You might need to compare the results layer by layer to find out where the discrepancy comes from - I recommend starting from the raw score map.

Hi, this is my own model after training. I'm using TF1.15, and the highest pytorch version I can install is 1.10.0 with cu10.2 because my python version is 3.6. What python version are you using?

sarlinpe commented 9 months ago

I'm using Python 3.7. What is reported in the cell checking the different of dense outputs?

Diff logits: 3.1471252e-05 4.6826185e-06 3.874302e-06 max/mean/median
Diff descriptors: 2.041459e-06 2.7050896e-07 2.2351742e-07 max/mean/median

You could maybe increase the detection threshold for the PyTorch model, but this would not solve the underlying issue - there must be an implementation difference somewhere.

noahzn commented 9 months ago

I'm using Python 3.7. What is reported in the cell checking the different of dense outputs?
Diff logits: 3.1471252e-05 4.6826185e-06 3.874302e-06 max/mean/median
Diff descriptors: 2.041459e-06 2.7050896e-07 2.2351742e-07 max/mean/median
You could maybe increase the detection threshold for the PyTorch model, but this would not solve the underlying issue - there must be an implementation difference somewhere.

Diff logits: 3.4570694e-05 3.1457146e-06 2.6226044e-06 max/mean/median Diff descriptors: 7.6293945e-06 2.6236998e-07 2.0861626e-07 max/mean/media

I ran the conversion code on the official sp_v6 checkpoint, and I can output the same numbers of points as shown in the notebook. But the diff numbers are different: Diff logits: 3.3140182e-05 4.75111e-06 4.053116e-06 max/mean/median Diff descriptors: 1.9967556e-06 2.7098568e-07 2.2351742e-07 max/mean/median

sarlinpe commented 9 months ago

The diff numbers are of the same order of magnitude. The issue must be in the keypoint selection, somewhere in this section: https://github.com/rpautrat/SuperPoint/blob/d8ebb9040fac489e23dd0b6f136976c329eed3ba/superpoint_pytorch.py#L135-L157 Do you mind sharing your checkpoint?

noahzn commented 9 months ago

model.zip

The diff numbers are of the same order of magnitude. The issue must be in the keypoint selection, somewhere in this section:

https://github.com/rpautrat/SuperPoint/blob/d8ebb9040fac489e23dd0b6f136976c329eed3ba/superpoint_pytorch.py#L135-L157

Do you mind sharing your checkpoint?

Here is my checkpoint.

sarlinpe commented 9 months ago

Thanks. I found the issue and fixed it in PR https://github.com/rpautrat/SuperPoint/pull/317. I suggest increasing your detection threshold to 0.01, the results will look closer to the model we have trained.

noahzn commented 9 months ago

Thanks. I found the issue and fixed it in PR #317. I suggest increasing your detection threshold to 0.01, the results will look closer to the model we have trained.

Thank you very much for the fix! Now the results look closer! Since I am going to use this trained model with LightGlue, I am wondering if glue-factory's open SuperPoint code has a similar nms problem?? I mean, if I used the converted PyTorch model that I generated yesterday, will it affect the training of LightGlue? Because LightGlue uses the model's output as input.

sarlinpe commented 9 months ago

No, your converted model and glue-factory's open SP are fine - the problem was in the inference TensorFlow model.

rpautrat / SuperPoint

big difference after converting model to pytorch #316