Is warming-up a critical part in the full model performance

Hi, thanks for sharing the code. As you mentioned in the README.md, a warm-up model is used to start up the 3-stage training process, which seems a pretraining process with adversarial training according to your code. However, this part is not discussed a lot in the paper. Since the warm-up model is used to initialize the Basemodel in stage1, and from the training instruction, each stage is highly relying on the trained model from its previous stage (either to initialize the Basemodel or initialze a Basemodel_ema), I wonder if I change the DA warm-up model to a regular source-only model for startup, will there be a severe chain-effect in downstream stage ?

microsoft / ProDA

Is warming-up a critical part in the full model performance #29