Open mmatena opened 3 years ago
The low dimensional space might lead to issues with non-convexity though. However, there are other optimization methods, such as perhaps evolutionary methods, that might be well suited when merging many models together at once.
It may also be of interest to see if the learned "mixing proportions" transfer to good mixing proportions for traditional multi-task fine-tuning. This reminds me a lot of task2vec.
I'm thinking of this in the context of boosting performance for a single task. Efficiency gains will be largest when merging multi-way with many tasks at once.
The idea is you freeze everything except for the merge weights and train them via gradient descent. Memory issues might be an issue for larger models though as you'll need to put 2 * N copies of the weights on the GPU for an N-way merge.
Try to see how much data you need. There's a good chance, at least for merging pairs, that you'll need very few as the overfitting potential is quite low.