How long does calibration take to find Retrieval heads / attention patterns?

mit-han-lab / duo-attention

DuoAttention: Efficient Long-Context LLM Inference with Retrieval and Streaming Heads

MIT License

252 stars 10 forks source link

How long does calibration take to find Retrieval heads / attention patterns? #2

Closed nicolefinnie closed 6 days ago

nicolefinnie commented 1 week ago

Thanks a lot for this promising idea. I wonder how applicable it is during real inference when you cannot really force users to use DuoAttention and the model has been trained. I saw your pretrained attention patterns (full_attention_head.tsv) for Llama 3 8B families and Mistral 7B families.

How long did it take to get the attention patterns to be able to run inference efficiently? and it was 1 node 8 A100s from your script?
Did you update the weights during "training" (or calibration) or the model can be run as it is, we only need to run with the found attention patterns?

Thanks a lot for you time for reading my question.

Guangxuan-Xiao commented 6 days ago

Hi, thank you for your interest in our work and for the questions.

Obtaining these patterns is quite efficient. All of our experiments are conducted on a single 8xA100 80GB GPU node and are completed within 4 hours.
No, we freeze all model weights and only train the gate values. This ensures that the identification process does not affect the model's other general capabilities.

nicolefinnie commented 6 days ago

Thanks a lot @Guangxuan-Xiao for your quick response. So for a 7B model it takes 4 hours on one A100 server, it seems quite applicable, and sorry for not looking into the implementation details yet, but for a bigger model size, would the gate values stay the same size (like 32 rows)? I wonder if this solution would scale up, when the model size goes up, since the computation for forward pass would be substantially high?

Guangxuan-Xiao commented 6 days ago

We tested on 70B models. Please take a look at our paper for results.

nicolefinnie commented 6 days ago

We tested on 70B models. Please take a look at our paper for results.

Thank you, I couldn't find the training time of gate values for 70B and didn't find gate values of 70B from the repo, but if you needed the whole A100 node to train the gate values with frozen model parameters,I think 4 hours training time was referred to the 70 B model. Thanks a lot for pointing out.

The fundamental finding of attention sinks was a hit, and this new idea still combines both attention sinks and retrieval heads, really cool.

Btw, I noticed the citation on the page 7 referred to the original Adam optimizer. However, AdamW is published by Ilya Loshchilov and Prof. Frank Hutter, just a side note in hope that they can get one more citation from a cool paper. ;)

Guangxuan-Xiao commented 6 days ago

Thanks! We will add it in the next revision.