Applicability of the proposed protocol

HuizaiVictorYao commented 6 months ago

Hi @justinkay,

Thanks for your brilliant work. It's indeed a wonderful idea to unify DAOD with consistent backbone initialization params and develop more reliable evaluation metrics for DAOD. This is of great contribution to the community as we DAOD players must all have experienced suffering from unreasonable or unreproducible experimental results.

However I still doubt the applicability of the proposed protocol to ALL DAOD works. For those methods that did not use mean-teacher-like models, is applying mean-teacher-based source-only and oracle still leading to fair comparisons? Just take SADA as an example. As SADA did not use mean-teacher or teacher-student framework, is it still fair and reasonable to compare it against mean-teacher-based source-only or oracle (as we all know that mean-teacher can introduce huge performance improvement in many tasks)?

As there still exist many DAOD works in recent years that did not use MT (and also for those potential incoming future works without MT), I don't think it's applicable to use the proposed protocal to benchmark ALL DAOD works as you declared. If one wants to make comparisons with more methods that include non-MT-based methods, it would be more applicable to use conventional source-only and oracle, as comparing non-MT-based models with MT-based source-only and oracle will underestimate their adaptation performance?

justinkay commented 6 months ago

Hi @VictorYrotciV thanks very much for your interest and insightful comment!

You're right, this is a great point. Our proposed protocol in Sec 6.1 of the paper states: "any technique that does not need target-domain data to run should also be used by source-only and oracle models". So, following our own protocol, it is indeed inappropriate to compare SADA (which does not use EMA/strong augs) with a baseline that does include EMA/strong augs.

We made the decision to only show a single source-only/oracle model for the comparisons in Figures 1 and 5 because it would be cumbersome to show a different source-only/oracle model for each method, so we picked the most representative one. We will add a clarification around this. You could also argue that there should be a different source-only/oracle model for every DAOD method: one which adopts the same exact augmentation strategy of the DAOD method (we also plan to add this comparison to the supplemental material for completeness). We think this illustrates just how difficult it can be to fairly compare methods in this area, since there is overlap between domain generalization and domain adaptation methods.

So, for fairly evaluating the performance of SADA (and other future non-MT methods), there seem to be two options:

Report results for an upgraded version of SADA, called "SADA++" that also includes EMA and strong augmentations.
Report results for separate, weaker (no EMA/strong aug), source-only/oracle models to compare against SADA.

We believe choice 1 is the appropriate way forward for the same reason that we upgrade all architectures to stronger modern backbones -- we know that EMA and strong augs improve OOD performance overall, and as we see from our experiments, the relative strength of methods can change considerably when backbones and training settings are upgraded. Thus it will be most informative for the community to perform comparisons in a modern context rather than downgrading our source-only/oracle models.

We have performed experiments for both of these settings, and will include them in an update to the paper. Thanks very much for your engagement and suggestions!

"SADA++" experiments: Cityscapes to Foggy Cityscapes Source-only: 59.1 SADA++: 61.7 Oracle: 67.2

Weaker baseline experiments: Cityscapes to Foggy Cityscapes Source-only (no EMA or strong augs): 51.9 SADA: 54.2 Oracle (no EMA or strong augs): 64.6

HuizaiVictorYao commented 6 months ago

Thanks for your prompt reply!

I agree that it's troublesome to construct different source-only/oracle for every DAOD method with potentially different augmentation strategy. It's wise of you to conduct additional strong aug, and EMA for those non-MT-based methods, and this seems to be the most straightforward and effective solution to the problem. I believe there won't be any unfair comparison after these additional comparisons.

Appreciate your excellent work, and also appreciate your contribution to the community!

btw there seems to be some citing mistakes in your arxiv preprint? For example in page 4 "Adaptive Teacher (AT) [55] uses mean teacher with an image-level discriminator network. Masked Image Consistency (MIC)", [55] seems to be for " Xue, Z., Yang, F., Rajaraman, S., Zamzmi, G., Antani, S.: Cross dataset analysisof domain shift in cxr lung region detection. Diagnostics 13(6), 1068 (2023)", not Adaptive Teacher (Li, Y.J., Dai, X., Ma, C.Y., Liu, Y.C., Chen, K., Wu, B., He, Z., Kitani, K., Vajda, P.: Cross-domain adaptive teacher for object detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 7581–7590 (2022))

justinkay commented 6 months ago

Nice catch on the citation mistake, thanks @VictorYrotciV will update that

justinkay / aldi

Applicability of the proposed protocol #10