mutonix / pyramidinfer

23 stars 0 forks source link

LlamaForCausalLM is out of date #3

Closed gpzlx1 closed 1 month ago

gpzlx1 commented 1 month ago

Great work!

Currently, I am reproducing this work. I found that the LlamaForCausalLM used in the repository is out of date, and its memory cost is much higher than the LlamaForCausalLM from Hugging Face.

Here are the results:

# original model in pyramid
Total Token Num: 14072
Max GPU Memory Per GPU (MB): 32477.294

# original model from hf
# transformers @ git+https://github.com/huggingface/transformers@2e48b3e8725326abd3e9cf82718f7d6debdd8297
Total Token Num: 14072
Max GPU Memory Per GPU (MB): 22554.607

Based on the information provided, it seems that using the LlamaForCausalLM from the Hugging Face Transformers library (at the specified commit) is more memory-efficient than the version used in the original repository. I'd suggest updating your code to use the Hugging Face version, as it appears to have a lower memory footprint.

mutonix commented 1 month ago

We have provided the same version of original LlamaForCausalLM in modeling_llama.py for fair comparison. The latest LlamaForCausalLM is more efficient maybe because they reconstruct the code for KV cache and use the flash_attention.

At present, the pyramidinfer does not support flash_attn because flash_attn does not expose the API to obtain the attention weights. We will later experiment on the latest huggingface or vllm, which may take a while.