khoj-ai / khoj

Your AI second brain. Get answers to your questions, whether they be online or in your own notes. Use online AI models (e.g gpt4) or private, local LLMs (e.g llama3). Self-host locally or use our cloud instance. Access from Obsidian, Emacs, Desktop app, Web or Whatsapp.
https://khoj.dev
GNU Affero General Public License v3.0
12.63k stars 640 forks source link

Backed killed [FIX] #634

Open edbock opened 7 months ago

edbock commented 7 months ago

Describe the bug

A clear and concise description of what the bug is. Please include what you were expecting to happen vs. what actually happened.

Khoj is a great app, and works quite well, but I have issues during indexing. CPU is in constant use which doesn't surprise me, but sometimes it hogs all available CPU and my machine becomes unresponsive for a minute or two. I assume there's some kind of fail-safe that kicks in because the process ends with the message "Killed".

I have not been able to completely index all the files afaik. I assume this may have something to do with the pdf/image indexing functions. Here are the last four lines of terminal output:

[08:03:19 PM] WARNING Because the aspect ratio of the current image exceeds the limit (min_height or width_height_ratio), the program will skip the detection step. main.py:158 [08:06:48 PM] INFO 🔥 Deleted (0, {}) day-old user requests configure.py:346 [08:12:13 PM] WARNING Because the aspect ratio of the current image exceeds the limit (min_height or width_height_ratio), the program will skip the detection step. main.py:158 Killed

To Reproduce

Steps to reproduce the behavior:

khoj --anonymous-mode --disable-chat-on-gpu --verbose

Requires nothing on my part. This happens every time the backend has been running for more than an hour or two.

Platform

If self-hosted

Additional context

Add any other context about the problem here.

This has happened every single time I run the backend.

debanjum commented 7 months ago

Yeah, most likely this is happening when Khoj is trying to index the image pdf's in your knowledge base and running out of memory/cpu. What's the specifications (i.e RAM, CPU, VRAM on GPU) on the machine you're running Khoj on?

Can you gradually give it more of your content to sync? E.g Add one directory at a time and restart Khoj to sync that new data. This way once it's indexed all your data without being killed, it should be easier to sync any updates to you add to your knowledge base

edbock commented 7 months ago

Intel(R) Core(TM) i5-3230M CPU @ 2.60GHz Intel Corporation 3rd Gen Core processor Graphics Controller (rev 09) 16.6 GB RAM

Thank you for the suggestion of trying one directory at a time. I'll give that a go. I wish there was a way to know which directory/file was being processed though, it would make it much easier.

BTW when this happens I usually see some temporary PDFs in my home directory. I'm assuming these would have been cleaned up by the process if it had completed successfully. Maybe this might give me a clue as to where the trouble lies.

debanjum commented 7 months ago

Hey @edbock, were you able to get your PDF's indexed?

Fair point on visibility into which file was being last synced to better understand how to split the indexing of data, in such scenarios. Let me see how we can show that.

And not very sure but it does sound like the temp PDFs maybe from the process being killed in the middle of indexing. If so, then you're correct that that should provide at least some clue into where indexing stopped until a cleaner way to show what is being currently indexed is found

edbock commented 7 months ago

Thank you very much for following up. Unfortunately I haven't had any time to spend on this lately. I'll report back when I get a chance.

sabaimran commented 5 months ago

Hi @edbock ! Just looking for some clarification here.

  1. Does Khoj ever manage to go through and index all of your PDFs? Or does it always fail? I'm wondering whether the issue is a build-up of memory usage or just the batch size we're using to process data.
  2. Which client are you using when indexing? Is it coming from your Obsidian app?
  3. What kind of PDFs are these? Would they have a lot of image data, or would they primarily be textual?
edbock commented 5 months ago

@sabaimran, thank you for your questions. AFAIK so far Khoj has never managed to index all the PDFs. It often leaves 1-5 pdf files with "temp" or something like that as part of the file name. It is entirely possible that it is a memory usage issue.

I am using the command-line client. Although I am using the Obsidian interface to communicate with the client, I'm pretty sure it's the client that is causing the issues. Everything works fine until the client starts indexing files, and after a period of time (a few minutes or more), the computer locks up and then Khoj crashes. I don't have a swap file enabled so my suspicion is that Ubuntu kills the process to restore order to the system.

These are mostly text PDFs. They do contain images, but none of them are predominantly image-based AFAIK.

I have gone another route for a solution to this issue for myself. However, I would be glad to help with testing this issue if you want. As long as you can give me some specific things to watch for, report on, etc.

debanjum commented 3 months ago

This is a follow-up to #1822

Hey @Openegg15, can you clarify what the link you've shared references? And provide more context to your statement?