Closed Zhengsh123 closed 2 days ago
FusionBench was my later work after I gained more experience with fine-tuning. Looking back now, I realize that the hyperparameters I set during this work weren't optimal - particularly the weight decay values I used for fine-tuning those models were likely too high/aggressive.
Thank you very much for your timely response. Another small question: when calculating the cosine similarity for LoRA, are you computing the similarity between W = BA? When I calculated it on ViT-B-16, the similarity was similar to or even lower than the full finetune results.
I computed similarity using flattened and concatenated matrices. The matrices B and A are first flattened into vectors. These flattened vectors are then concatenated into a single vector. This approach captures the raw parameter changes without considering their multiplicative interaction.
This paper discusses the difference between these two methods: LoRA Soups: Merging LoRAs for Practical Skill Composition Tasks.
If you consider the differences between W, the model is already partially linearized and fine-tuned. But for B and A, it's a quadratic optimization problem, so linearization works.
Thank you very much for your response. I understand now, this has been very helpful to me.
Fine-Tuning Linear Layers Only Is a Simple yet Effective Way for Task Arithmetic
So you can simply fully fine-tune the linear layers in attention modules instead of linearized lora adapters.
Hello, you have done a great piece of work. However, I have a question regarding Figure 2 in the original paper. Why is there such a large difference between the similarity of the task vector in Figure 2(a) and what is shown in FuseBench as below (my results are closer to FuseBench)?
Is there something I might have overlooked?