mit-han-lab / TinyChatEngine

TinyChatEngine: On-Device LLM Inference Library
https://mit-han-lab.github.io/TinyChatEngine/
MIT License
624 stars 58 forks source link

Buffer overflow with Llama 3 8B #109

Open renepeinl opened 3 weeks ago

renepeinl commented 3 weeks ago

I tested with Ubuntu 24.04 LTS on two different PCs, both having 16 GB of main memory and no dedicated GPU. I therefore run all models solely on the CPU. I was able to run Mistral 7B (AWQ Int4) together with Whisper small and Piper TTS without any problems. However, when trying to run Llama 8B (AWQ Int4) the model loads but generates a buffer overflow as soon as I issue the first query, even without ASR and TTS running in parallel. I checked the main memory with top, but I can't see that RAM is full. Any suggestions on how to get Llama 3 running with that HW configuration? Any plans to support Phi 3 any time soon?