Closed jbellis closed 8 months ago
I believe this is because in OCT the quantization was happening at runtime. You can do this with the command line args in run-cli.sh
using the -q Q4
parameter, or you can quantize the model once with the quantize
command.
Here's how I quantize llama models.
./run-cli.sh quantize -q Q4 -s "model.embed_tokens.weight" -s "lm_head.weight" models/Mixtral-8x7B-Instruct-v0.1/
This worked in Oct 15 jlama:
Now it OOMs (note that I have doubled the default Xmx, which was not necessary in Oct)