Question about WEMoE vs PEFT

Thanks for your awesome work in model merging! I'm excited about the improvements you achieved compare to other merging methods. However, I saw the individually fine-tuned models still out-perform WEMoE across all datasets. This makes me wonder what's stopping us from directly using these task-specific models for downstream tasks.

The reasons I can think of are these two: 1) you used traditional fine-tuning methods and it's beneficial to combine these huge models together for efficiency; 2) it's nontrivial to decide which model to use.

But both of these seems to be solvable with relatively straightforward solutions:

We can use PEFT methods like LoRA or prompt tuning to save space.
We can use some test data to decide which PEFT to use.

What am I missing here, could you please enlighten me? Thanks!

Thank you for your attention and thoughtful comment.

In our study, our objective is to transfer the task-specific knowledge from individual task-specific models to the pre-trained model, which possesses general knowledge, in order to create a unified multi-task model. Because it can be expensive to manage when dealing with a large number of task-specific models. Combining models into a unified model can reduce redundancies, and optimize resource utilization.

This work focus on vision transformers, while PEFT techniques are indeed promising for saving space and reducing the resource requirements of individual models. PEFT methods such as LoRA are found to be less effective than full fine-tuning on vision domain. However, for language models, PEFT methods can be competitive with the full fine-tuning approach. There are some papers that merge language models that is fine-tuned using PEFT method, such as AdapterSoup and LoraHub.

On the other hand, in the experimental setup for the multi-task model fusion, it is assumed that either no training data can be accessed, or only a limited number of data samples from the training datasets can be accessed. Like the common cases when you download models from HuggingFace. So in practice, it is not always possible to decide on the fine-tuning approach for the pre-trained model with regard to downstream tasks.

Use some test data to decide which PEFT (i.e. the weight patches that contain the task-specific knowledge) to use is exactly the idea behind our method. We fine-tune the routers of our proposed weight-ensembling MoE during the test phase using test-time adaptation method to dynamically integrate with the most appropriate task-specific knowledge patches. Our method embodies a form of dictionary learning, where the dictionary consists of different task vectors and and decodes the routing weights to the model weights.

tanganke / weight-ensembling_MoE

Question about WEMoE vs PEFT #2