naver / mast3r

Grounding Image Matching in 3D with MASt3R
Other
783 stars 41 forks source link

[QUESTION] of inference with the image shape of 256 #35

Open yukiumi13 opened 2 weeks ago

yukiumi13 commented 2 weeks ago

Thank you for your fancy work!

I am trying to integrate mast3r into some large 3D networks. Due to memory limitations, I can only input images of up to 256x256. Therefore, I would like to know if it is correct to directly change 'true_shape' in input dict to (256, 256) and input images of shape (b, 3, 256, 256) with the published 512x512 pretrained weights? I noticed that doing so did not result in any explicit errors, but the number of matches found decreased dramatically.

ljjTYJR commented 2 weeks ago

Hi, I wonder when you say "I noticed that doing so did not result in any explicit errors", do you mean the predicted point map?

yukiumi13 commented 2 weeks ago

Hi. Yes, mast3r could output pts3d with the shape of bx256x256x3, since it would adjust patch size automatically. I found the low performance may be caused a simple NN searching method I used in my own code without iteration. I'm refactoring fast_nn_reciprocal method in mast3r to make it support batched inputs and fully based on torch enabling BP.

yukiumi13 commented 1 week ago

After implementing the fast_nn method described in MASt3R, we observed a significant increase in correspondences. Additionally, based on some papers I read, it is recommended to use inputs with the shape (B, 3, 224, 224) during training to prevent disturbances in the PE of the ViT.

截屏2024-09-02 21 10 46
ljjTYJR commented 1 week ago

Which paper mentionds the disturbances of PE? Will it also affect RoPe?

yukiumi13 commented 1 week ago

Hello. From my perspective, there has not been any specific paper on 3D Vision discussing this issue. But some papers on Video Generation using ViT have examined it. For example, Extrapolation of Position Code in CogVideo has shown a deteriorated generation quality when changing inference resolution directly.

rwn17 commented 1 week ago

After implementing the fast_nn method described in MASt3R, we observed a significant increase in correspondences. Additionally, based on some papers I read, it is recommended to use inputs with the shape (B, 3, 224, 224) during training to prevent disturbances in the PE of the ViT. 截屏2024-09-02 21 10 46

I also observed a similar performance drop with 256*256 resolution and converting images to 224*224 can make dust3r happy. @yukiumi13 Do you think the https://github.com/naver/dust3r/issues/62 strategy may help?