Open lucasgblu opened 1 month ago
Thank you for your attention to our work! We implement zero-init by setting the weight and bias of the final to_out linear layer of the attention module to 0 (most of the original knowledge is retained). Such a technique is just a trick in smooth training and will not have a big impact on the performance of the model itself.
thanks for your replay! I thought at first you were doing zero gating like what LLAMA-Adaptor did.
by the way, the arxiv link of your paper in the homepage misleadingly directs to your another paper @hrz2000
by the way, the arxiv link of your paper in the homepage misleadingly directs to your another paper @hrz2000
Thanks for your reminder~~ I will correct it now hahaha
Hi ! congrats on this wonderful job. After reading your paper, I'm really curious about one technique that you use.
In the paper, you said:
How you do zero-initialize? is it zero-conv like ControlNet, or you just force it to be zeros? or you make the weight of Q, K or V to be zeros so that the outcome is zeros? Once you did this, will the model still learn to absorb the knowledge from the condition or it stays unlearned for zeros provide minor grads?
Finally, how good is this technique? Does it greatly or visibly improve the quality?
Congrats again