lpiccinelli-eth / UniDepth

Universal Monocular Metric Depth Estimation
Other
473 stars 39 forks source link

Question on Depth Module #6

Closed Aurelien-VB closed 3 months ago

Aurelien-VB commented 3 months ago

Thank you for this work ! I have troubles understanding the Depth Module and especially why the Keys and Values are taken from the Camera Embedding $E_1$ and not from the initial Depth Features $D$ coming from the encoder. Doesn't this lead to the model losing the information of the encoder ?

lpiccinelli-eth commented 3 months ago

Thank you for asking the question.

If I understand correctly, I think that this possible misunderstanding comes from the fact that, actually, there is a residual (skip) connection in the cross-attention between DepthFeatures and CameraEmbeddings. This means that the depth features are only "corrected" with the camera embeddings, namely D = D + CrossAttn(D, E).

Aurelien-VB commented 3 months ago

Thanks for the quick answer, it makes sense !