Closed gpzlx1 closed 3 months ago
We have provided the same version of original LlamaForCausalLM in modeling_llama.py for fair comparison. The latest LlamaForCausalLM is more efficient maybe because they reconstruct the code for KV cache and use the flash_attention.
At present, the PyramidInfer can be compatible with flash attention via flex attention.
Great work!
Currently, I am reproducing this work. I found that the
LlamaForCausalLM
used in the repository is out of date, and its memory cost is much higher than theLlamaForCausalLM
from Hugging Face.Here are the results:
Based on the information provided, it seems that using the
LlamaForCausalLM
from the Hugging Face Transformers library (at the specified commit) is more memory-efficient than the version used in the original repository. I'd suggest updating your code to use the Hugging Face version, as it appears to have a lower memory footprint.