How to understand "burn-in"

First of all, thank you very much for your outstanding contributions to the DAOD community! I will conduct my research on your ALDI codebase and together promote the DAOD community！ ❀❀❀

Additionally, I have some questions I'd like to ask you:

P1. How to understand burn-in "fixed" or "-" in Table 1. I have reproduced MIC and AT.

●In MIC, the total iteration is set to 6W, and the multi-learning stage is started when the pseudo labels are generated (i.e., in the detection of the teacher model, there are instances with confidence greater than 0.8), and it usually starts after 140 iterations, i.e., the burn-in stage is 140 iterations? ●In AT, we have the concept of "burn-in", the burn-in stage is 2W, the total iteration is 10W P1.1 So how to understand the "fixed"? P1.2I noticed that in your code these two phases are trained separately, but I can't see the difference with AT(AT does not need to be trained in two phases) P1.3 I also don't know "keep an EMA copy of the model during burn-in". Can you explain to me the role of copying EMA in burn-in.

P2 hyper-parameter of MIN_SIZE_TRAIN

I know that the effect of this hyperparameter on the performance of the model is very large. However, this part was not analyzed in the paper. In AT, the value is MIN_SIZE_TRAIN: (600,); In MIC, the value is 800 ; while in ALDI the value is MIN_SIZE_TRAIN: (800, 832, 864, 896, 928, 960, 992, 1024). I'm not sure how this parameter works in your codebase, can you explain this?

I am very much looking forward to your reply! Wish you have a good time!☀

Hi @HDUyiming, thanks for your questions! We plan to add some extra details about this to the supplemental material.

P1 For PT and AT, we say "Fixed" in Table 1 to indicate they used a fixed number of iterations for burn-in in the original implementation. For instance in AT this is a config value: https://github.com/facebookresearch/adaptive_teacher/blob/5256463ad9ec90fd5ba84ebb8d53bed56bd369df/configs/faster_rcnn_VGG_cross_city.yaml#L45

However we did not feel it was fair to just take the same number of iterations from the original implementations since it is unlikely the training curves would look similarly. So for PT and AT in ALDI, we use the best possible burn-in checkpoint, i.e. use early stopping on target validation to pick the burn-in stopping point. This most likely improves both PT and AT in our codebase compared to the original implementations, unless the number of fixed iterations was already heavily tuned.

In MIC there is no explicit burn-in stage, so yes as you say, self-training essentially starts once pseudo-labels reach high confidence. We consider this "no burn-in".

P2 MIN_SIZE_TRAIN: (800, 832, 864, 896, 928, 960, 992, 1024) is what we call multi-scale augmentation in the paper, and it is used for all methods in our paper for fair comparison (SADA, PT, AT, MIC, UMT, and ALDI++). We were recently informed by an observant peer-reviewer that we forgot to indicate this in Table 1 -- this was a typo we will fix; as you can see in the config files all methods use multi-scale for fair comparison.

We will consider adding an ablation for multi-scale to the supplemental as well, as it does improve performance, but note that all methods benefit from it in our framework.

Hi @HDUyiming, thanks for your questions! We plan to add some extra details about this to the supplemental material.

P1 For PT and AT, we say "Fixed" in Table 1 to indicate they used a fixed number of iterations for burn-in in the original implementation. For instance in AT this is a config value: https://github.com/facebookresearch/adaptive_teacher/blob/5256463ad9ec90fd5ba84ebb8d53bed56bd369df/configs/faster_rcnn_VGG_cross_city.yaml#L45

However we did not feel it was fair to just take the same number of iterations from the original implementations since it is unlikely the training curves would look similarly. So for PT and AT in ALDI, we use the best possible burn-in checkpoint, i.e. use early stopping on target validation to pick the burn-in stopping point. This most likely improves both PT and AT in our codebase compared to the original implementations, unless the number of fixed iterations was already heavily tuned.

In MIC there is no explicit burn-in stage, so yes as you say, self-training essentially starts once pseudo-labels reach high confidence. We consider this "no burn-in".

P2 MIN_SIZE_TRAIN: (800, 832, 864, 896, 928, 960, 992, 1024) is what we call multi-scale augmentation in the paper, and it is used for all methods in our paper for fair comparison (SADA, PT, AT, MIC, UMT, and ALDI++). We were recently informed by an observant peer-reviewer that we forgot to indicate this in Table 1 -- this was a typo we will fix; as you can see in the config files all methods use multi-scale for fair comparison.

We will consider adding an ablation for multi-scale to the supplemental as well, as it does improve performance, but note that all methods benefit from it in our framework.

Thank you very much!

justinkay / aldi