khoj-ai / khoj

Your AI second brain. Get answers to your questions, whether they be online or in your own notes. Use online AI models (e.g gpt4) or private, local LLMs (e.g llama3). Self-host locally or use our cloud instance. Access from Obsidian, Emacs, Desktop app, Web or Whatsapp.
https://khoj.dev
GNU Affero General Public License v3.0
12.63k stars 640 forks source link

[IDEA] Support other quantizations #653

Closed harish0201 closed 7 months ago

harish0201 commented 7 months ago

Hi!

Maybe I overlooked the documentation, but is there a way to:

  1. Use other quantizations other than Q4. I have the RAM and VRAM. I'd like to have better responses.
  2. Use custom endpoints like llama.cpp's server mode? I can see that GPT4All has just a few models, this ties in back with 1 above, since I don't have the limitation of choosing whatever GPT4All has in their repertoire.
harish0201 commented 7 months ago

Nevermind, for the first one, I got it.

I symlinked the files from my model folder and renamed: mistral-7b-instruct-v0.2.Q5_K_M.gguf to mistral-7b-instruct-v0.2.Q5_K_M.gguf3.gguf

Still curious about the second one though!

debanjum commented 7 months ago

Nevermind, for the first one, I got it.

I symlinked the files from my model folder and renamed: mistral-7b-instruct-v0.2.Q5_K_M.gguf to mistral-7b-instruct-v0.2.Q5_K_M.gguf3.gguf

Nice! The symlink was to allow mistral-7b-instruct-v0.2 with the Q5_K_M quantization to work? What's the response quality? Maybe also try some of the other higher-quality Mistral fine-tunes like OpenChat-0106

Use custom endpoints like llama.cpp's server mode? I can see that GPT4All has just a few models, this ties in back with 1 above, since I don't have the limitation of choosing whatever GPT4All has in their repertoire.

Still curious about the second one though!

Try the docs on setting up an OpenAI compatible proxy server to use whatever model you want. Let me know if that doesn't work?

PS: Converting this issue into a Github discussion for now