mit-han-lab / llm-awq

[MLSys 2024 Best Paper Award] AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration
MIT License
2.08k stars 150 forks source link

Out of memory in Jetson Orin NX 8GB #165

Open pprp opened 3 months ago

pprp commented 3 months ago

Thank you for your great work! I am trying to deploy llama-2-7b-chat model to Jetson Orin NX 8G.

I followed the instructions in Tinychat but found that when loading llama-2-7b-chat 4bit g128, it got killed due to out of memory issue.

Then, I followed the demo in NVIDIA Jetson website (https://www.jetson-ai-lab.com/tutorial_text-generation.html) and downloaded Llama-2-7B-Chat-GPTQ (https://huggingface.co/TheBloke/Llama-2-7B-Chat-GPTQ). It worked well but still got killed after several conversation.

As you mentioned in Readme, llama-2-7b-chat can work in Jetson Orin NX 8G. Is there anything that I missed? Any thought?