Open ChaofanTao opened 4 weeks ago
Based on Fig Da in the model stock paper, I would guess it is the max of the pairwise angles between all pairs within the three, so max(theta_w12, theta_w13, theta_w13). But perhaps @hellbell can help to confirm this?
In the mergekit impmlementation, they consider average of all pairwise angles, however I am still not sure if this is the right aggregation operation we should be using over the max.
Additional discussion here: https://github.com/arcee-ai/mergekit/issues/453
@ChaofanTao @vishaal27 Hi, when $N \gt 2$, we compute $\cos{θ}$ by averaging all pairwise cosine values $\cos{θ{ij}}$ (where $i \neq j$). I.e., $\cos{θ}=\sum{i=1..N,j=1..N,i \neq j}{\cos{θ}_{ij}}$ in the Eq (10)-(12). So the mergekit's implementation is right. We will release our code soon.
Thanks for ur reply!
Awesome thanks so much for clarifying, however I am still a bit unclear why it should be the average of all pairwise cosine values and not the max? Intuitively it seems to me that the max will provide a more accurate merged vector. Could you please provide some thoughts/intuition on this @hellbell, that would be very helpful! :)
@vishaal27 Thank you for the following question :)
We did not use intuition to derive Eq (10); it was driven from Eq (8), Eq (9), and Appendix B. In our setting, we consider different model weights $w_i$ and $wj$ by altering random seeds and fixing the hyper-parameters. So, I thought each $\cos{θ{ij}}$ might be very similar, and we did not carefully mention it in the paper.
To discuss more, could you give me your intuition of why $\max{ij}(\cos{θ{ij}})$ would work better?
Based on Fig Da in the model stock paper, I would guess it is the max of the pairwise angles between all pairs within the three, so max(theta_w12, theta_w13, theta_w13). But perhaps @hellbell can help to confirm this?
I have done extensive experimentation with task vectors, and taking the max is possibly the worst option - it often completely destabilizes the model.
Thanks for your work!
I am confused about the eq. 3 in your paper about the computation of theta. Assuming we have the weights from the base model (w0), and 3 finetuned models (w1,w2,w3), and then we have the vectors by:
w01 = w1 - w0 w02 = w2 - w0 w03 = w3 - w0,
I wonder how to compute the theta? Alternatively speaking, I am not sure what 'the angle θ between the pretrained model and the N fine-tuned model' is in the paper.
Thanks for your time.