[IDEA] Support other quantizations

harish0201 commented 7 months ago

Hi!

Maybe I overlooked the documentation, but is there a way to:

Use other quantizations other than Q4. I have the RAM and VRAM. I'd like to have better responses.
Use custom endpoints like llama.cpp's server mode? I can see that GPT4All has just a few models, this ties in back with 1 above, since I don't have the limitation of choosing whatever GPT4All has in their repertoire.

harish0201 commented 7 months ago

Nevermind, for the first one, I got it.

I symlinked the files from my model folder and renamed: mistral-7b-instruct-v0.2.Q5_K_M.gguf to mistral-7b-instruct-v0.2.Q5_K_M.gguf3.gguf

Still curious about the second one though!

debanjum commented 7 months ago

Nevermind, for the first one, I got it.

I symlinked the files from my model folder and renamed: mistral-7b-instruct-v0.2.Q5_K_M.gguf to mistral-7b-instruct-v0.2.Q5_K_M.gguf3.gguf

Nice! The symlink was to allow mistral-7b-instruct-v0.2 with the Q5_K_M quantization to work? What's the response quality? Maybe also try some of the other higher-quality Mistral fine-tunes like OpenChat-0106

Use custom endpoints like llama.cpp's server mode? I can see that GPT4All has just a few models, this ties in back with 1 above, since I don't have the limitation of choosing whatever GPT4All has in their repertoire.

Still curious about the second one though!

Try the docs on setting up an OpenAI compatible proxy server to use whatever model you want. Let me know if that doesn't work?

PS: Converting this issue into a Github discussion for now

khoj-ai / khoj

[IDEA] Support other quantizations #653