LlamaForCausalLM is out of date

Great work!

Currently, I am reproducing this work. I found that the LlamaForCausalLM used in the repository is out of date, and its memory cost is much higher than the LlamaForCausalLM from Hugging Face.

Here are the results:

# original model in pyramid
Total Token Num: 14072
Max GPU Memory Per GPU (MB): 32477.294

# original model from hf
# transformers @ git+https://github.com/huggingface/transformers@2e48b3e8725326abd3e9cf82718f7d6debdd8297
Total Token Num: 14072
Max GPU Memory Per GPU (MB): 22554.607

Based on the information provided, it seems that using the LlamaForCausalLM from the Hugging Face Transformers library (at the specified commit) is more memory-efficient than the version used in the original repository. I'd suggest updating your code to use the Hugging Face version, as it appears to have a lower memory footprint.

mutonix / pyramidinfer

LlamaForCausalLM is out of date #3