khoj-ai / khoj

Your AI second brain, open and self-hostable. Get answers to your questions, whether they be online or in your own notes. Use online AI models (e.g gpt4) or private, local LLMs (e.g llama3).
https://khoj.dev
GNU Affero General Public License v3.0
12.69k stars 648 forks source link

Add support for other LLMs like Anthropic #318

Closed Shekhars closed 2 weeks ago

Shekhars commented 1 year ago

Hi, Thanks for this project. Great work! 2 questions:

  1. Do you have plans to support LLMs other than OpenAI like Anthropic or Vertex (Google)?
  2. I am trying to setup for the first time and passed a folder of PDFs to Khoj. I get this error:
    ERROR    No valid entries found in specified files: compressed_jsonl=/Users/ss/.khoj/content/pdf/pdf.jsonl.gz,                                  api.py:409
                    embeddings_file=~/.khoj/content/pdf/pdf_embeddings.pt, input_files=None, input_filter=['~/Documents/BANK STATEMENT 2018-2023/**/*.pdf'],               
                    index_heading_entries=False                                                                                                                            

    Do I need OpenAI key for embeddings or are the embeddings created locally?

sabaimran commented 1 year ago

Hey @Shekhars! Thanks for issue.

  1. Yes, we want to extend the suite of LLM support. We're waiting for access to some of the other foundation models, but in tandem prioritizing locally hosted LLMs. See also #201. I don't think that we'll setup some kind of BYOM (bring your own model) architecture, but we want to have a couple of well-supported options.
  2. You do not need an OpenAI key for embeddings. They should be generated locally using sentence-transformers/multi-qa-MiniLM-L6-cos-v1.

Could you walk me through the steps you took?

I'm guessing you first updated the configuration in the PDF settings page, and then saved it. After that, did you click Configure? Do you see any warnings in your console when you click Configure, or is it straight to the error? I'm able to reproduce that log line if I pass in a filepath to a directory that has no PDFs in it.

I'm not sure about this, but it might also be unhappy with spaces in the folder name. Can you try replacing the spaces with underscores and see if it works? I'll try to repro on my end as well.

Shekhars commented 1 year ago

@sabaimran Thanks for responding. Looks like you were right. Embeddings seems to be working fine. Quick question: seems like embeddings are being appended to a jsonl file. Why not use a lightweight vector DB. Sqlite supports vector search. Chroma could be another good option. Regarding Anthropic, looks like the PR you mentioned supports local models. Would the same work extend support to Anthropic?

debanjum commented 1 year ago

Embeddings seems to be working fine.

That's great to hear! Did searching your PDFs with Khoj work well enough for your use-case?

Quick question: seems like embeddings are being appended to a jsonl file. Why not use a lightweight vector DB. Sqlite supports vector search. Chroma could be another good option.

The indexed text chunks are stored in a jsonl file, the embedding vectors are stored in pytorch format.

Vector DBs weren't worth adding as a dependency for the minor speed benefits they provided us back in 2021. But I do plan to try Sqlite (with the vss extension) for the app logic simplification and memory benefits it can provide to support even 10+ GB personal knowledge bases

What benefits do you see in using a vector DB for Khoj?

Regarding Anthropic, looks like the PR you mentioned supports local models. Would the same work extend support to Anthropic?

No, unfortunately the PR to support offline chat models isn't going to enable using Anthropic models. What would you like to use Anthropic for (instead of OpenAI)? Did you try Khoj chat yet? And what (if anything) do you find lacking in its current form?

Shekhars commented 1 year ago

@sabaimran thanks for your response.

Yes, the search worked quite well and was performant enough for my test use case. I am going to plug a bigger doc library and see how it works.

My actual usecase is definitely much bigger than a few gigs. It's a small finance firm and office collects tons of docs daily. I am looking to leverage a system like Khoj to do semantic search and possibly some analysis. I can try creating a PR if I can get Chroma/Sqlite working with Khoj.

I am insistent of Anthropic because of selfish reasons :) API access if free for me for testing right now and I find their instant model great for something like this (chat with a set of docs). It's ridiculously fast and good enough for non-reasoning type usecases. Besides, my personal view is that projects like should leverage as many different hosted LLMs as possible by creating the right abstraction (which I believe you're already making). That way it's not entirely dependent on OpenAI and the direction they take. This, of course, is far fetched.

TheCatster commented 1 year ago

I'm another person who would much prefer Anthropic. My reason is the context window, and it's ability to understand significantly more at once. Examples of this include large codebases, legal documents and other text where having a massive amount of it to reference is critical.

ishaan-jaff commented 1 year ago

@debanjum @sabaimran Happy to take lead on this via integrating with https://github.com/BerriAI/litellm It adds support for Azure, OpenAI, Palm, Anthropic, Cohere Models

Let me know if this sounds good

debanjum commented 2 weeks ago

Khoj now supports any OpenAI compatible proxy server (including LiteLLM and Ollama), any offline chat model in GGUF format and has first party support for Anthropic too