Closed arnavgarg1 closed 11 months ago
6 files ±0 6 suites ±0 14m 21s :stopwatch: +4s 12 tests ±0 9 :heavy_check_mark: ±0 3 :zzz: ±0 0 :x: ±0 60 runs ±0 42 :heavy_check_mark: ±0 18 :zzz: ±0 0 :x: ±0
Results for commit 9bacdc47. ± Comparison against base commit 6fb795d5.
@alexsherstinsky Yes, that is exactly right :)
Paper: https://arxiv.org/pdf/2205.05638.pdf
Adds support for a new PEFT strategy called IA3, which adds 2 learned vectors to the V and Q projections in attention heads, as well as a learned vector to the feed-forward network. The idea is that these learned vectors can help rescale the attention values and feed-forward network values. l_k, l_v, and l_ff are all initialized with ones so that the overall function computed by the model does not change when they are added.
IA3 makes mixed-task batches possible because each sequence of activations in the batch can be separately and cheaply multiplied by its associated learned task vector (in some ways, very similar to how you can train different rank decomposed matrices with LoRA for each task). In the event that a model will only be used on a single task, the modifications introduced by IA3 can also be applied to weight matrices permanently so that no elementwise multiplication is required and the model’s architecture remains unchanged. This is possible because element-wise multiplications performed in IA3 always co-occur with matrix multiplication, which means that there is no additional computational cost compared to the original model.