Open cramraj8 opened 2 months ago
I was trying to use flashattention with replace_with_xformers_attention(). but with recent transformers, i believe LLaMA can direct use flashattention by specificing atten_implementation when loading the pretrained model. this line is not necessary any more.
Got it. Thank you.
Hi @MXueguang ,
I wonder what's the purpose of having replace_with_xformers_attention() defined in the utils.py because I am getting the following error,
Does the
self.num_key_value_heads
value in the replace_with_xformers_attention() defined somewhere else ?