Some suggestions for anyone who use SID on other generative tasks or datasets

mingyuanzhou / SiD

PyTorch code and model checkpoints for Score identity Distillation (SiD) published in ICML 2024

Apache License 2.0

76 stars 5 forks source link

First of all, thanks to the authors for their excellent contribution to the diffusion modeling community, sid is definitely the coolest distillation method I've ever tried (and more interesting compared to discrete/continuous coherent distillation) I tried sid on a speech generation task, and at first I completely followed the hyperparameters of the best fid in the paper, but then I struggled to get meaningful results other than noise. After some tweaking of the hyperparameters, I got normal results and performed above my expectations (just one step!) . What follows are some of my experiences:

1, try more alpha values like 0.5, 0.75, 1.0, for some of you who don't know the background, you can and imagine the method in the paper as similar to cfg (classifier-free-guidance), loss_2 is similar to the conditional output in cfg, which has a positive meaning, and loss_1 is similar to unconditional output in cfg, which has a negative meaning, so just like the coefficients in cfg, we need to adjust the proportion of the negative that comes from loss_1. Theoretically, loss_1 is the component used to adjust for the gradient bias in loss_2 due to the inability of stop_grad to be derived. alpha=0.5, alpha=1, was shown to theoretically remove the gradient bias completely (https://arxiv.org/abs/2410.19310, https:// arxiv.org/abs/2410.16794), so it can be tried, while 0.75 and 1.25 are excessive elimination of bias (just like excessive elimination of unconditional components in cfg), which may not lead to perfect gradient adjustment, but it can be cautiously tried. 2, try different learning rates, especially smaller learning rates like 1e-6. I have no theoretical proof about learning rates, but my experience says that we can't set larger learning rates to make the model deviate quickly from the original “true score” (teacher's model/pre-trained model) weights, note that sid is a data-free method, so we need to be very careful about how we use it. Note that sid is a data-free method, so we need to rely heavily on the weights of the pre-trained model for one-step generation. If the model moves away from the initial weights at a very fast rate at the beginning of the training, the training may fail because the modeler is not able to get back on the right track. 3, the weights of the loss function. For the same reason as above, we need a reasonable loss function weight for updating, the default weight used in the paper is 100, if you find that this makes your loss too big, you can make it normal by adjusting the alpha, adjusting the learning rate, or adjusting the weight of the loss function. 4, keep patience, my experience shows that sid is a proven one-step distillation method, if you do not get normal results for a while, please keep patience and try more, it is theoretically no problem.

Thank you very much for generously sharing your experience with SiD, and I’m delighted to hear that SiD has exceeded your expectations on your speech generation task! Below are some thoughts and recommendations:

Setting Alpha
I recommend setting $\alpha = 1$ as the default and, if resources allow, tuning it later to explore potential performance improvements. I view $\alpha$ as a gradient-bias correction factor. The two recent papers you mentioned provide alternative perspectives on deriving the SiD loss as used in practice, but it's important to emphasize that regardless of the derivation, we cannot entirely eliminate the bias introduced by replacing the perfect fake score network with an estimated one. I've had extensive discussions on these points with Jianlin, who recently wrote an excellent blog post related to these discussions: https://kexue.fm/archives/10567.

Additionally, we have introduced a new method, SiDA, which will be integrated into this repository. Unlike SiD, SiDA is not data-free and leverages training data to address the gap between the pretrained teacher and the theoretically optimal teacher, resulting in significantly improved performance. We also observe that SiDA is less sensitive to the specific choice of $\alpha$, provided $\alpha > 0$. Again, we recommend $\alpha =1$ by default.
Learning Rate
The learning rate is indeed important, but it can typically be identified within the range of 1e-6 to 1e-4. For instance:
- To distill Stable Diffusion 1.5 and 2.1-base using SiD-LSG, we used 1e-6.
- To distill EDM2 on ImageNet 512×512, we used 5e-5.

Loss Scaling
SiD outputs the loss during iterations. Based on my experience, adjusting the loss scaling to keep the loss around 100 can help avoid numerical issues under fp16 precision. If there are concerns about the current loss scaling, I recommend starting with fp32 if memory permits. Once things are stable in fp32, switch to fp16 and adjust the loss scaling so the loss stabilizes around 1 to 1000.
Debugging and Progress
If the code is bug-free and the hyperparameter choices are reasonable, SiD should demonstrate exponentially fast progress. If it does not work or progress stalls, it’s likely due to bugs, such as improper gradient cutoff somewhere in the code (SiD relies on backpropagating gradients through both the pretrained teacher score network and the fake-score network). This characteristic not only facilitates efficient debugging but also enables state-of-the-art performance given sufficient computational resources.

I hope these insights prove helpful, and I’m excited to hear about your continued success with SiD!

mingyuanzhou / SiD

Some suggestions for anyone who use SID on other generative tasks or datasets #3