Closed ktadgh closed 2 months ago
Hi, sorry for the late reply. We didn't try any batch-size < 50 for the image generation experiments unfortunately. The batch size will have a big effect on the performance of distillation generally and it is also difficult to do any gradient accumulation-like tricks when using feature whitening.
The issue I see is that the covariance matrix may be close to singular, which would lead to issues when computing the inverse (SVD). You may be able to accumulate many teacher features before pre-computing the covariance matrix (which is then used to compute the whitening matrix), though I have not tried this.
Thanks for the reply! I think I can see where the issue is coming from. Even with accumulating features I'm fairly limited when training on 1k resolution images. At the moment I've just started training on smaller images, I will try to think of a solution for larger images if the results are positive. Thanks again!
Hello, I've tried to implement the image generation loss using feature whitening - I'm wondering if you have done this with low batch sizes? As it seems the whitening can be very inaccurate at lower batch sizes. At the moment I'm simulating a larger batch by pooling 16 batches before calculating the loss. Please let me know if you encountered that issue and have any suggestions.