Your AI second brain. Get answers to your questions, whether they be online or in your own notes. Use online AI models (e.g gpt4) or private, local LLMs (e.g llama3). Self-host locally or use our cloud instance. Access from Obsidian, Emacs, Desktop app, Web or Whatsapp.
When a document is uploaded to Khoj, we chunk the file and create Entrys in the database to represent parts of the file.
This creates a limitation, because we don't have the full file saved to the database. If a user requests any kind of summarization option, we aren't able to fulfill that request.
To fix this, store the raw file along with the Entry objects. We should add a new data model for the File object that stores the raw text of the file. It should look something like this:
class FileObject(BaseModel):
file_name = models.CharField...
raw_text = models.Textfield...
And update the Entry object like so:
class Entry(BaseModel):
...
file_object = models.ForeignKey(FileObject,...)
...
We might run into a size constraint on really large files. In that case, we would want to throw an error. Eventually, we can upload the file to S3 instead of throwing an error.
When a file with an existing filename for the user lands for processing, check the hashes of the current file and the incoming file and see if they're different. If so, process the new file and update the entries as usual, updating the FileObject with the new reference.
You'll also want to add a conversation command called summarize which should lookup the file names that are accessible to the given user and find the best match for whatever the user is requesting. It should get the whole text and send it to the LLM for response.
Relevant files:
markdown_to_entries.py for processing each of the uploaded markdown files. Each of the file types or data sources we support has its own such file (i.e., pdf_to_entries)
embeddings.py for computing embeddings of the relevant chunks
indexer.py for the API endpoints that read in a new file
models.init.py for the database model that describes Entry objects
Describe the feature you'd like
When a document is uploaded to Khoj, we chunk the file and create
Entry
s in the database to represent parts of the file.This creates a limitation, because we don't have the full file saved to the database. If a user requests any kind of summarization option, we aren't able to fulfill that request.
To fix this, store the raw file along with the
Entry
objects. We should add a new data model for theFile
object that stores the raw text of the file. It should look something like this:And update the
Entry
object like so:We might run into a size constraint on really large files. In that case, we would want to throw an error. Eventually, we can upload the file to S3 instead of throwing an error.
When a file with an existing filename for the user lands for processing, check the hashes of the current file and the incoming file and see if they're different. If so, process the new file and update the entries as usual, updating the
FileObject
with the new reference.You'll also want to add a conversation command called
summarize
which should lookup the file names that are accessible to the given user and find the best match for whatever the user is requesting. It should get the whole text and send it to the LLM for response.Relevant files:
pdf_to_entries
)Entry
objects