Closed dog3-l0ver closed 2 months ago
Llama3-8B has a much larger output layer. It simply requires a lot more VRAM to convert the final part of the model than Llama2-7B. I'm not really sure if there's anything that can be done about it, realistically.
Oh, that is quite unfortunate. I just recently learned how to quantize using your awesome piece of code so I could stop being dependent on others for exl quants. It's a shame I'll be back to square one with all the upcoming Llama 3 fine-tunes and merges. I also apologize for my lacking knowledge about the model layers, I'm still very much learning and intuition, as it often does, has failed. Thank you for taking the time to respond and once again for your efforts with this project. Have a good day.
Hello, I hope this is not an oversight on my part, but I am unable to quantize the new Llama-3-8B-Instruct.
I have an 8GB NVIDIA RTX 2070 and 64GB RAM. I'm on the latest 0.0.19 version and quantizing the same Llama-3 reupload from NousResearch as you did with the following command:
I have previously been successful in quantizing multiple other models up to 11B parameters (haven't tried anything above) using the same command (with different paths of course) with version 0.0.17. I will try to quantize the other models with version 0.0.19 to see if this is model-specific, meanwhile this is what I get when resuming quantization after the OOM (It's always after layer 31 regardless if resuming or starting from scratch):
Just to be clear, I'm running Exllamav2 from withing TGW's Conda environment. I have updated the exllamav2 package using the 0.0.19 .whl file from the releases page.
As a sidenote, thank you very much for this amazing work you give us for free. It still stings a little when people discuss having multiple 3090s or 4090s for LLMs, but thanks to you I can at least run a plethora of smaller, but still capable models at blazing fast speeds and without them being noticeably lobotomized haha.
EDIT: No problems quantizing WestLake-10.7B-v2 with 0.0.19 version.
EDIT2: Exact same OOM when trying to quantize Llama-3 to 4bpw instead of 4.25bpw.
EDIT3: And again when trying 3bpw. Same "Tried to allocate 1002.00MiB" and still after layer 31.
EDIT4: Stopped SDDM, killed KWin, made sure my VRAM was completely empty. Still OOM after layer 31. I refuse to believe an 8B model that's around 15GB in size has a single layer bigger than 8GB.