Out of memory in Jetson Orin NX 8GB

Thank you for your great work! I am trying to deploy llama-2-7b-chat model to Jetson Orin NX 8G.

I followed the instructions in Tinychat but found that when loading llama-2-7b-chat 4bit g128, it got killed due to out of memory issue.

Then, I followed the demo in NVIDIA Jetson website (https://www.jetson-ai-lab.com/tutorial_text-generation.html) and downloaded Llama-2-7B-Chat-GPTQ (https://huggingface.co/TheBloke/Llama-2-7B-Chat-GPTQ). It worked well but still got killed after several conversation.

As you mentioned in Readme, llama-2-7b-chat can work in Jetson Orin NX 8G. Is there anything that I missed? Any thought?

mit-han-lab / llm-awq