naver-ai / model-stock

Model Stock: All we need is just a few fine-tuned models
92 stars 1 forks source link

how to compute the theta in the case of multiple (N>=3) finetuned models. #2

Open ChaofanTao opened 4 weeks ago

ChaofanTao commented 4 weeks ago

Thanks for your work!

I am confused about the eq. 3 in your paper about the computation of theta. Assuming we have the weights from the base model (w0), and 3 finetuned models (w1,w2,w3), and then we have the vectors by:

w01 = w1 - w0 w02 = w2 - w0 w03 = w3 - w0,

I wonder how to compute the theta? Alternatively speaking, I am not sure what 'the angle θ between the pretrained model and the N fine-tuned model' is in the paper.

Thanks for your time.

vishaal27 commented 3 weeks ago

Based on Fig Da in the model stock paper, I would guess it is the max of the pairwise angles between all pairs within the three, so max(theta_w12, theta_w13, theta_w13). But perhaps @hellbell can help to confirm this?

vishaal27 commented 3 weeks ago

In the mergekit impmlementation, they consider average of all pairwise angles, however I am still not sure if this is the right aggregation operation we should be using over the max.

Additional discussion here: https://github.com/arcee-ai/mergekit/issues/453

hellbell commented 3 weeks ago

@ChaofanTao @vishaal27 Hi, when $N \gt 2$, we compute $\cos{θ}$ by averaging all pairwise cosine values $\cos{θ{ij}}$ (where $i \neq j$). I.e., $\cos{θ}=\sum{i=1..N,j=1..N,i \neq j}{\cos{θ}_{ij}}$ in the Eq (10)-(12). So the mergekit's implementation is right. We will release our code soon.

ChaofanTao commented 3 weeks ago

Thanks for ur reply!

vishaal27 commented 3 weeks ago

Awesome thanks so much for clarifying, however I am still a bit unclear why it should be the average of all pairwise cosine values and not the max? Intuitively it seems to me that the max will provide a more accurate merged vector. Could you please provide some thoughts/intuition on this @hellbell, that would be very helpful! :)

hellbell commented 3 weeks ago

@vishaal27 Thank you for the following question :)

We did not use intuition to derive Eq (10); it was driven from Eq (8), Eq (9), and Appendix B. In our setting, we consider different model weights $w_i$ and $wj$ by altering random seeds and fixing the hyper-parameters. So, I thought each $\cos{θ{ij}}$ might be very similar, and we did not carefully mention it in the paper.

To discuss more, could you give me your intuition of why $\max{ij}(\cos{θ{ij}})$ would work better?

maldevide commented 2 weeks ago

Based on Fig Da in the model stock paper, I would guess it is the max of the pairwise angles between all pairs within the three, so max(theta_w12, theta_w13, theta_w13). But perhaps @hellbell can help to confirm this?

I have done extensive experimentation with task vectors, and taking the max is possibly the worst option - it often completely destabilizes the model.