Why ollama and Q4 quantization?

Hi, opening an issue as I can't PM you on twitter. First: I think the project is really cool. Second: I just larp as a redteamer and recently started working on an LLM project(tldw), so by all means, I am far from being knowledgeable.

That being said, why ollama instead of plain llama.cpp or llamafile? https://github.com/Mozilla-Ocho/llamafile

Also, why Q4 quant instead of a Q8 quant? Since you're using a 7B model, the quantization has a much larger effect on the reasoning capability of the model, than it would if it were say a 70B.

Paper discussing the impact of quantization on Llama3: https://arxiv.org/abs/2404.14047 Granted, you're using Mistral, but still, according to your blogpost(don't have the URL handy) if I remember correctly, it said that you also finetuned Mistral at Q4 quantization, and (again, talking out of my ass) I would assume you would suffer similar issues with that level of quantization with Mistral as you would Llama3.

Pointing this out in case it wasn't known at the time, or if it was, you might be able to help point me to some new info. Thanks again for the project, I think it's pretty neat.

mrphrazer / reverser_ai

Why ollama and Q4 quantization? #7