Closed nicolefinnie closed 6 days ago
Hi, thank you for your interest in our work and for the questions.
Thanks a lot @Guangxuan-Xiao for your quick response. So for a 7B model it takes 4 hours on one A100 server, it seems quite applicable, and sorry for not looking into the implementation details yet, but for a bigger model size, would the gate values stay the same size (like 32 rows)? I wonder if this solution would scale up, when the model size goes up, since the computation for forward pass would be substantially high?
We tested on 70B models. Please take a look at our paper for results.
We tested on 70B models. Please take a look at our paper for results.
Thank you, I couldn't find the training time of gate values for 70B and didn't find gate values of 70B from the repo, but if you needed the whole A100 node to train the gate values with frozen model parameters,I think 4 hours training time was referred to the 70 B model. Thanks a lot for pointing out.
The fundamental finding of attention sinks was a hit, and this new idea still combines both attention sinks and retrieval heads, really cool.
Btw, I noticed the citation on the page 7 referred to the original Adam optimizer. However, AdamW is published by Ilya Loshchilov and Prof. Frank Hutter, just a side note in hope that they can get one more citation from a cool paper. ;)
Thanks! We will add it in the next revision.
Thanks a lot for this promising idea. I wonder how applicable it is during real inference when you cannot really force users to use DuoAttention and the model has been trained. I saw your pretrained attention patterns (full_attention_head.tsv) for Llama 3 8B families and Mistral 7B families.
Thanks a lot for you time for reading my question.