Closed Aurelien-VB closed 3 months ago
Thank you for asking the question.
If I understand correctly, I think that this possible misunderstanding comes from the fact that, actually, there is a residual (skip) connection in the cross-attention between DepthFeatures and CameraEmbeddings. This means that the depth features are only "corrected" with the camera embeddings, namely D = D + CrossAttn(D, E)
.
Thanks for the quick answer, it makes sense !
Thank you for this work ! I have troubles understanding the Depth Module and especially why the Keys and Values are taken from the Camera Embedding $E_1$ and not from the initial Depth Features $D$ coming from the encoder. Doesn't this lead to the model losing the information of the encoder ?