khoj-ai / khoj

Your AI second brain. Self-hostable. Get answers from the web or your docs. Build custom agents, schedule automations, do deep research. Turn any online or local LLM into your personal, autonomous AI (e.g gpt, claude, gemini, llama, qwen, mistral).
https://khoj.dev
GNU Affero General Public License v3.0
14.28k stars 709 forks source link

[Tech Question] Query biasing problem #439

Closed Gangxin-Li closed 1 year ago

Gangxin-Li commented 1 year ago

Hi Ai leaders,

I am wondering how you address the issue of query bias.

From what I understand, we are trying to build a word embedding to acquire new knowledge. However, when we submit a query, it is directly fed into the word embedding. How do we determine the contents of the fields? And how do you enhance the accuracy of the results?

For instance, if I input "history" knowledge into the model, but only want to ask a general question rather than a history-specific one, it seems there is no mechanism in place to manage this distinction. My proposal is to develop a binary text classification system to handle the query, and give it a strong instructions when we promote a query.

What are your thoughts on this? Or dose default LLM has handle that?

Many thanks, Gangxin

debanjum commented 1 year ago

Hi Gangxin, please correct me if my understanding is wrong, but it seems like you want Khoj to be able to respond to some questions without referencing your personal knowledge base?

If so, the two techniques Khoj currently uses to answer general questions are:

  1. The chat model will answer your query using it's general knowledge if it doesn't find any relevant information in your personal knowledge base
  2. You can prefix your chat message with @general if you want to force the chat model to respond using only it's general knowledge. This will prevent it from trying to retrieve any relevant entries from your personal knowledge base. Example query: @general How did the dinosaurs die out? will force the chat model to answer using it's general knowledge and not use anything in you knowledge base
Gangxin-Li commented 1 year ago

Thank you for your response!!!

  1. I completely understand. At this time, it's merely inferring from the model.
  2. That sounds great. Can you provide a link detailing how you implemented that? It's a brilliant idea.

I want to understand how Khoj distinguishes whether questions are related to my personal knowledge or not. In more detail, we use keyword-based methods or TF-IDF to determine if new questions are relevant to our knowledge. Against this backdrop, I've built a binary classification system to detect whether a new question is related or not. But I am not sure whether my binary classification could beat the original LLM recognition.

Has LLM already implemented this well? I'm uncertain about how to specify the fields that LLM identifies. I have checked several materials about this one, but nothing valuable found.