Dear authors: a confusion about network architecture.

yuedajiong commented 3 months ago

Dear author, if there are no inherent constraints on the input images (such as B must always be to the left of A or even stricter constraints), what is the reason behind TransformerDecoder_1/2 and Header_1/2 being required to have different weights instead of sharing weights for later information sharing? According to my limited understanding, after 'perfect' or 'sufficient' training, Decoder_1/2 and Header_1/2 should be nearly identical. In this case, what is the significance of not sharing weights?

in shortly: if no inherent differents, and after sufficient training, decoder_header_1/2 should be nearly identical. consider this: input single image I to this two-path network, output diffent point-map and camera-pose. is it reasonable? is this we wanted?

a thought experiment: swap input image pairs A and B, maybe we will get different output performaces: worse or better. right?

vincent-leroy commented 3 months ago

The pointmaps to predict are both expressed in the reference frame of camera 1. If Decoder 2 was the same as Decoder 1, it would thus predict pointmaps in the reference frame of the input view (camera 2), which would mean both predictions do not live in the same coordinate system.

yuedajiong commented 3 months ago

I feel that I haven't grasped the essence and subtlety of your design. According to my limited understanding, if the functions of the two paths are similar, then the difference in output should be solely due to the differences in input, rather than an apparent error caused by network parameters.

yuedajiong commented 3 months ago

Thanks!

naver / dust3r

Dear authors: a confusion about network architecture. #38