Buffer overflow with Llama 3 8B

I tested with Ubuntu 24.04 LTS on two different PCs, both having 16 GB of main memory and no dedicated GPU. I therefore run all models solely on the CPU. I was able to run Mistral 7B (AWQ Int4) together with Whisper small and Piper TTS without any problems. However, when trying to run Llama 8B (AWQ Int4) the model loads but generates a buffer overflow as soon as I issue the first query, even without ASR and TTS running in parallel. I checked the main memory with top, but I can't see that RAM is full. Any suggestions on how to get Llama 3 running with that HW configuration? Any plans to support Phi 3 any time soon?

mit-han-lab / TinyChatEngine

Buffer overflow with Llama 3 8B #109