follow up paper - FSRT: Facial Scene Representation Transformer for Face Reenactment from Factorized Appearance, Head-pose, and Facial Expression Features #6
7.2. Training Details
We train on three NVIDIA A100 (80GB) GPUs for about
23 days. We found that warming up (i.e. Phase I training,
explained in Sec. 3.3) is essential to avoid ending up in local minima. Also, the batch size should be large enough. In
our experiments we found out that 24 is sufficient. With
a batch size of eight, training progressed slowly and appeared to be very unstable. Furthermore, we ended up in
a local minimum with poor inference performance. When
adding adversarial losses in training Phase III, we allow the
discriminator to warm up for 500 iterations without computing gradients for the model. This is essential since otherwise the untrained discriminator will influence the current
training progress with gradients of large magnitude.
https://arxiv.org/pdf/2404.09736
7.2. Training Details We train on three NVIDIA A100 (80GB) GPUs for about 23 days. We found that warming up (i.e. Phase I training, explained in Sec. 3.3) is essential to avoid ending up in local minima. Also, the batch size should be large enough. In our experiments we found out that 24 is sufficient. With a batch size of eight, training progressed slowly and appeared to be very unstable. Furthermore, we ended up in a local minimum with poor inference performance. When adding adversarial losses in training Phase III, we allow the discriminator to warm up for 500 iterations without computing gradients for the model. This is essential since otherwise the untrained discriminator will influence the current training progress with gradients of large magnitude.
https://github.com/johndpope/VASA-1-hack/issues/5