mayooear / gpt4-pdf-chatbot-langchain

GPT4 & LangChain Chatbot for large PDF docs
https://www.youtube.com/watch?v=ih9PBGVVOO4
14.86k stars 3.01k forks source link

Question about adding and removing PDFs #419

Closed erok415 closed 11 months ago

erok415 commented 11 months ago

Hi there,

Throughout the lifespan of a project, a PDF may need updating or removal from the dataset. When I make such changes and then execute npm run ingest, does the vector database in Pinecone reflect these updates?

Could you provide some guidance on the best practices for data rerunning in this context?

Thanks!

dosubot[bot] commented 11 months ago

🤖

Hello,

The ingest-data.ts script in the gpt4-pdf-chatbot-langchain repository is responsible for loading PDF documents from a specified directory, splitting the text into chunks, creating embeddings for these chunks, and storing these embeddings in a Pinecone vector store. However, it does not provide explicit functionality to handle updates or removals of PDFs from the dataset.

When you run npm run ingest, it will simply load all PDFs in the specified directory and store their embeddings in Pinecone. If a PDF has been updated or removed, the changes will not be reflected in Pinecone unless you manually delete the existing vector store and re-run the ingestion script.

To handle updates or removals of PDFs, you might consider the following steps:

  1. Track Changes: Keep track of which PDFs have been updated or removed since the last ingestion. This could be done by maintaining a log or database of changes.

  2. Delete Old Vectors: Before re-ingesting, delete the vectors corresponding to the updated or removed PDFs from the Pinecone vector store. This can be done by sending a DELETE request to the Pinecone API with the source filename of the document you want to delete. Here is an example command:

curl -i -X POST 'https://INDEX-IDSTRING.svc.REGION.pinecone.io/vectors/delete?namespace=NAMESPACE' \
  -H 'Api-Key: API_KEY_HERE' \
  -H 'Content-Type: application/json' \
  -d '{
    "filter": {"source": {"$in": ["/Users/username/src/gpt4-pdf-chatbot-langchain/docs/ALL.pdf"]}}
  }'

Replace the placeholders with your actual values. You can find more details in this issue.

  1. Re-Ingest: Run the ingestion script again to ingest the updated PDFs. This will create new vectors for the updated text and store them in Pinecone.

Please note that these are general suggestions and might need to be adapted to your specific use case. You can find more details about the ingestion process in the ingest-data.ts script here.

I hope this helps! If you have any other questions, feel free to ask.

Best, Dosu

Sources


This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.

erok415 commented 11 months ago

That is a great answer and super thorough. I ❤️ bots!