naver / mast3r

Grounding Image Matching in 3D with MASt3R
Other
1.32k stars 99 forks source link

[QUESTION] of inference with the image shape of 256 #35

Open yukiumi13 opened 2 months ago

yukiumi13 commented 2 months ago

Thank you for your fancy work!

I am trying to integrate mast3r into some large 3D networks. Due to memory limitations, I can only input images of up to 256x256. Therefore, I would like to know if it is correct to directly change 'true_shape' in input dict to (256, 256) and input images of shape (b, 3, 256, 256) with the published 512x512 pretrained weights? I noticed that doing so did not result in any explicit errors, but the number of matches found decreased dramatically.

ljjTYJR commented 2 months ago

Hi, I wonder when you say "I noticed that doing so did not result in any explicit errors", do you mean the predicted point map?

yukiumi13 commented 2 months ago

Hi. Yes, mast3r could output pts3d with the shape of bx256x256x3, since it would adjust patch size automatically. I found the low performance may be caused a simple NN searching method I used in my own code without iteration. I'm refactoring fast_nn_reciprocal method in mast3r to make it support batched inputs and fully based on torch enabling BP.

yukiumi13 commented 2 months ago

After implementing the fast_nn method described in MASt3R, we observed a significant increase in correspondences. Additionally, based on some papers I read, it is recommended to use inputs with the shape (B, 3, 224, 224) during training to prevent disturbances in the PE of the ViT.

截屏2024-09-02 21 10 46
ljjTYJR commented 2 months ago

Which paper mentionds the disturbances of PE? Will it also affect RoPe?

yukiumi13 commented 2 months ago

Hello. From my perspective, there has not been any specific paper on 3D Vision discussing this issue. But some papers on Video Generation using ViT have examined it. For example, Extrapolation of Position Code in CogVideo has shown a deteriorated generation quality when changing inference resolution directly.

rwn17 commented 2 months ago

After implementing the fast_nn method described in MASt3R, we observed a significant increase in correspondences. Additionally, based on some papers I read, it is recommended to use inputs with the shape (B, 3, 224, 224) during training to prevent disturbances in the PE of the ViT. 截屏2024-09-02 21 10 46

I also observed a similar performance drop with 256*256 resolution and converting images to 224*224 can make dust3r happy. @yukiumi13 Do you think the https://github.com/naver/dust3r/issues/62 strategy may help?