khoj-ai / khoj

Your AI second brain. Get answers to your questions, whether they be online or in your own notes. Use online AI models (e.g gpt4) or private, local LLMs (e.g llama3). Self-host locally or use our cloud instance. Access from Obsidian, Emacs, Desktop app, Web or Whatsapp.
https://khoj.dev
GNU Affero General Public License v3.0
12.64k stars 640 forks source link

Indexing Job Schedule #520

Open jh0274 opened 11 months ago

jh0274 commented 11 months ago

Hi!

Thank you for the awesome plugin.. so good.

Can i ask why you took the decision to have the indexing job run every 60mins in obsidian? The reason i ask is because i wanted to reduce it but wondered if there would be knock-on impacts?

I'm also wondering whether you'd perhaps intended to have people use Khoj alongside other Obsidian searches that index much quicker? Omnisearch etc..

Thanks again!

debanjum commented 10 months ago

60mins is an arbitrary interval to run the indexing job. This was just based on the idea that folks may not need to search, chat about stuff they've just written down. This can be reduced to any interval that is greater than the time it takes Khoj to index your knowledge base and shouldn't have any other impact.

Changing the indexing job interval isn't trivial currently, as we don't expose a user configuration. So you'll have to modify the source code and build khoj locally.

The reason Khoj takes longer to index on first run (then say Omnisearch) is it uses a machine learning/AI model to generate the index. This is more compute intensive, than traditional indexing. It shouldn't take too much time to update the index on subsequent runs though (as it only updates changes and isn't indexing everything from scratch everytime)

jh0274 commented 10 months ago

Thanks! I’ve noticed some strange behaviour on the indexing side where some markdown files are only partially indexed… I’m debugging now and will let you know more tomorrow.

Sent from my iPhone

On 2 Nov 2023, at 17:32, Debanjum @.***> wrote:



60mins is an arbitrary interval to run the indexing job. This was just based folks not needing to search, chat about stuff they've just written down. This can be reduced to any interval that is greater than the time it takes khoj to index your knowledge base and shouldn't have any other impact.

The reason Khoj takes longer to index on first run (then say Omnisearch) is it uses a machine learning/AI model to generate the index. This is more compute intensive, than traditional indexing. It shouldn't take too much time to update the index on subsequent runs though (as it only updates changes and isn't indexing everything from scratch everytime)

— Reply to this email directly, view it on GitHubhttps://github.com/khoj-ai/khoj/issues/520#issuecomment-1791220820, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AFIY32OVHVZX7FKLBUTJLWLYCPKJBAVCNFSM6AAAAAA6TBD4Z2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTOOJRGIZDAOBSGA. You are receiving this because you authored the thread.Message ID: @.***>

jh0274 commented 10 months ago

@debanjum so there's nothing wrong in the indexing. I just noticed that the query "who is around and wants to meetup in November" (for chat) returns results where periods of time (Q3 2023, Late October, Next year etc etc) are ranked higher than the result i was looking for which specifically mentions 'meeting up in November'. This makes sense given the nature of the search..

I've had a quick look at S-Bert (which i think hosts the underlying models?) for a way of configuring the encoding/embedding better deal with these types of searches/queries but not found much. Are you aware of any changes to config i can make to help this?

Thanks again!

debanjum commented 10 months ago

Oh are you using offline chat or OpenAI in Khoj?

The date awareness of the bare SBert search AI models isn't that great. But we work around that by exposing search query filters like the date filter. If you use OpenAI (preferably GPT4) for chat, the model can usually answer such questions better. The offline chat doesn't currently use query filter to reduce response latency.

jh0274 commented 10 months ago

Thanks that’s helpful. Im using Sbert but will give GPT4 a go and see if that improves

Sent from my iPhone

On 3 Nov 2023, at 22:43, Debanjum @.***> wrote:



Oh are you using offline chat or OpenAI in Khoj?

The date awareness of the bare SBert search AI models isn't that great. But we work around that by exposing search query filters like the date filterhttps://docs.khoj.dev/#/advanced?id=query-filters. If you use OpenAI (preferably GPT4) for chat, the model can usually answer such questions better. The offline chat doesn't currently use query filter to reduce response latency.

— Reply to this email directly, view it on GitHubhttps://github.com/khoj-ai/khoj/issues/520#issuecomment-1793205306, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AFIY32O3CUHF7I4C32Z2CL3YCVXQPAVCNFSM6AAAAAA6TBD4Z2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTOOJTGIYDKMZQGY. You are receiving this because you authored the thread.Message ID: @.***>