lhoyer / MIC

[CVPR23] Official Implementation of MIC: Masked Image Consistency for Context-Enhanced Domain Adaptation
268 stars 40 forks source link

Consistency Loss not used for detection #49

Closed ShenZheng2000 closed 1 year ago

ShenZheng2000 commented 1 year ago

In the context of object detection, the original call to the consistency_loss function is replaced with a calculation for L1 loss. The rationale behind this modification is unclear from the provided information.

Besides, the idea of enforcing consistency between detection results for a target image and its masked version appears counterintuitive. It is expected that the detection performance for a completely masked object would be compromised, making the pursuit of consistency between the two scenarios challenging.

Could you explain it a little bit? Thanks!

krumo commented 1 year ago

Hi, thanks for your question! For the calling of consistency loss, our implementation is based on SADA framework instead of the prior da-faster framework. The implementation of consistency loss between SADA and da-faster is slightly different. The original call is applied on da_img_consist and da_ins_consist, while the new L1 loss is applied on da_img_rois_probs_pool and da_ins_consist. To align the image-level features with da_ins_consist, the original call will compute the average of da_img_consist among all spatial locations while the new implementation will compute ROI pooled da_img_consist. In practice, the latter one should be more stable than the former one.

For applying MIC on object detection, we mask the target images randomly and generally it would mask the foreground objects as well as background regions. Enforcing the consistency between student predictions and the pseudo labels produced by teacher would alleviate the misclassfication for background and improve the recognition of occluded foreground objects. As the masking is performed randomly, there does exist a chance that some foreground objects are completely masked. However, this risk also exists in semantic segmentation task and the image classification task. The experiments and analysis in the paper suggest that the enhancement of context learning brings more improvement compared to the possible loss of masking objects.