[IDEA] Add better support for document summarization

Describe the feature you'd like

When a document is uploaded to Khoj, we chunk the file and create Entrys in the database to represent parts of the file.

This creates a limitation, because we don't have the full file saved to the database. If a user requests any kind of summarization option, we aren't able to fulfill that request.

To fix this, store the raw file along with the Entry objects. We should add a new data model for the File object that stores the raw text of the file. It should look something like this:

class FileObject(BaseModel):
    file_name = models.CharField...
    raw_text = models.Textfield...

And update the Entry object like so:

class Entry(BaseModel):
...
    file_object = models.ForeignKey(FileObject,...)
...

We might run into a size constraint on really large files. In that case, we would want to throw an error. Eventually, we can upload the file to S3 instead of throwing an error.

When a file with an existing filename for the user lands for processing, check the hashes of the current file and the incoming file and see if they're different. If so, process the new file and update the entries as usual, updating the FileObject with the new reference.

You'll also want to add a conversation command called summarize which should lookup the file names that are accessible to the given user and find the best match for whatever the user is requesting. It should get the whole text and send it to the LLM for response.

Relevant files:

markdown_to_entries.py for processing each of the uploaded markdown files. Each of the file types or data sources we support has its own such file (i.e., pdf_to_entries)
embeddings.py for computing embeddings of the relevant chunks
indexer.py for the API endpoints that read in a new file
models.init.py for the database model that describes Entry objects

khoj-ai / khoj

[IDEA] Add better support for document summarization #787

Describe the feature you'd like