noahzn / Lite-Mono

[CVPR2023] Lite-Mono: A Lightweight CNN and Transformer Architecture for Self-Supervised Monocular Depth Estimation
MIT License
540 stars 61 forks source link

About the CDC module and LGFI module #131

Closed QingTianNNN closed 6 months ago

QingTianNNN commented 7 months ago

Thank you for your hard work , your work helps me a lot . I am a beginner in the field of computer visionI, and have some questions about CDC module and LGFImodule, I hope you can answer them . Why operate the Batch normalization after 3×3 DDWConv instead of 1×1 point-wise?In the LGFI module, why is Layer normalization performed first, and how do local features and global features interact?

noahzn commented 7 months ago

Hi! CDC: We only use BN on the features after the dilated conv (feature extraction). Here the two 1x1 point-wise convolutions are used to enhance and aggregate the local information, and we didn't consider putting it after the 1x1 convolutions.

LGFI: 1) In the original Vision Transformer paper, the norm layer is also put in front of the attention layers to normalize the input. 2) For local and global feature interactions, the input feature is added to the intermediate outputs.

QingTianNNN commented 6 months ago

Thank you to answer me. The two 1×1 point-wise concolutions in CDC and LGFI are both used to enhance and aggregate the local information? Can I regard the three operations of 1×1 point-wise concolution GELU and 1×1 point-wise concolution as a full connection? Not only can features be enhanced and aggregated, but the number of output channels can also be changed.

noahzn commented 6 months ago

In our implementation we use nn.Linear because it is equivalent when the kernel size of nn.Conv2d is 1. We expand the intermediate channel numbers, but the final output channels remain the same as the input.

QingTianNNN commented 6 months ago

Q1. The two 1×1 point-wise concolutions in CDC and LGFI are both used to enhance and aggregate the local information? Q2. In LGFI module , is the role of cross-covariance attention to pay attention to the CDC moduled output's channel? I'm thankful that you can answer me.

noahzn commented 6 months ago
  1. For CDC, it is. For LGFI, I would not use the word "local information". This module is for local/global information fusion.
  2. Yes, it's an attention mechanism focusing on the correlations between channels.
noahzn commented 6 months ago

I am now closing this thread due to lack of response. You can reopen it or create a new issue if you have further questions.

LLLYLong commented 6 months ago

I am now closing this thread due to lack of response. You can reopen it or create a new issue if you have further questions. @noahzn Hello, I would like to ask a question, for dividing the kitti dataset, is there a proper way to select a small portion of it to test the performance of the network, not for the final comparison, but only for evaluating whether the current network is effective or not. Due to the performance limitations of the graphics card it would take too long to run on the full dataset, so if the author has made any similar changes, I hope he can help me!

noahzn commented 6 months ago

I am now closing this thread due to lack of response. You can reopen it or create a new issue if you have further questions. @noahzn Hello, I would like to ask a question, for dividing the kitti dataset, is there a proper way to select a small portion of it to test the performance of the network, not for the final comparison, but only for evaluating whether the current network is effective or not. Due to the performance limitations of the graphics card it would take too long to run on the full dataset, so if the author has made any similar changes, I hope he can help me!

@LLLYLong Hi, please open a new ticket for your new question. It won't take long time to evaluate on the whole evaluation set. But if you are asking if using a small portion of images for training, I would say that it would definitely affect the results.