tanganke / peta

Code for paper "Parameter Efficient Multi-task Model Fusion with Partial Linearization"
https://arxiv.org/abs/2310.04742
14 stars 1 forks source link

About Cosine Similarity of Task Vector #4

Closed Zhengsh123 closed 2 days ago

Zhengsh123 commented 2 days ago

Hello, you have done a great piece of work. However, I have a question regarding Figure 2 in the original paper. Why is there such a large difference between the similarity of the task vector in Figure 2(a) and what is shown in FuseBench as below (my results are closer to FuseBench)? image

Is there something I might have overlooked?

tanganke commented 2 days ago

FusionBench was my later work after I gained more experience with fine-tuning. Looking back now, I realize that the hyperparameters I set during this work weren't optimal - particularly the weight decay values I used for fine-tuning those models were likely too high/aggressive.

Zhengsh123 commented 2 days ago

Thank you very much for your timely response. Another small question: when calculating the cosine similarity for LoRA, are you computing the similarity between W = BA? When I calculated it on ViT-B-16, the similarity was similar to or even lower than the full finetune results.

tanganke commented 2 days ago

I computed similarity using flattened and concatenated matrices. The matrices B and A are first flattened into vectors. These flattened vectors are then concatenated into a single vector. This approach captures the raw parameter changes without considering their multiplicative interaction.

This paper discusses the difference between these two methods: LoRA Soups: Merging LoRAs for Practical Skill Composition Tasks.

If you consider the differences between W, the model is already partially linearized and fine-tuned. But for B and A, it's a quadratic optimization problem, so linearization works.

Zhengsh123 commented 2 days ago

Thank you very much for your response. I understand now, this has been very helpful to me.

tanganke commented 2 days ago

Fine-Tuning Linear Layers Only Is a Simple yet Effective Way for Task Arithmetic

So you can simply fully fine-tune the linear layers in attention modules instead of linearized lora adapters.