khoj-ai / khoj

Your AI second brain. Get answers to your questions, whether they be online or in your own notes. Use online AI models (e.g gpt4) or private, local LLMs (e.g llama3). Self-host locally or use our cloud instance. Access from Obsidian, Emacs, Desktop app, Web or Whatsapp.
https://khoj.dev
GNU Affero General Public License v3.0
12.63k stars 640 forks source link

[IDEA] Add better support for document summarization #787

Closed sabaimran closed 3 months ago

sabaimran commented 3 months ago

Describe the feature you'd like

When a document is uploaded to Khoj, we chunk the file and create Entrys in the database to represent parts of the file.

This creates a limitation, because we don't have the full file saved to the database. If a user requests any kind of summarization option, we aren't able to fulfill that request.

To fix this, store the raw file along with the Entry objects. We should add a new data model for the File object that stores the raw text of the file. It should look something like this:

class FileObject(BaseModel):
    file_name = models.CharField...
    raw_text = models.Textfield...

And update the Entry object like so:

class Entry(BaseModel):
...
    file_object = models.ForeignKey(FileObject,...)
...

We might run into a size constraint on really large files. In that case, we would want to throw an error. Eventually, we can upload the file to S3 instead of throwing an error.

When a file with an existing filename for the user lands for processing, check the hashes of the current file and the incoming file and see if they're different. If so, process the new file and update the entries as usual, updating the FileObject with the new reference.

You'll also want to add a conversation command called summarize which should lookup the file names that are accessible to the given user and find the best match for whatever the user is requesting. It should get the whole text and send it to the LLM for response.

Relevant files: