raghavan / PdfGptIndexer

RAG based tool for indexing and searching PDF text data using OpenAI API and FAISS (Facebook AI Similarity Search) index, designed for rapid information retrieval and superior search accuracy.
MIT License
655 stars 29 forks source link

PdfGptIndexer

Description

PdfGptIndexer is an efficient tool for indexing and searching PDF text data using OpenAI APIs and FAISS (Facebook AI Similarity Search). This software is designed for rapid information retrieval and superior search accuracy.

Libraries Used

  1. Textract - A Python library for extracting text from any document.
  2. Transformers - A library by Hugging Face providing state-of-the-art general-purpose architectures for Natural Language Understanding (NLU) and Natural Language Generation (NLG).
  3. Langchain - A text processing and embeddings library.
  4. FAISS (Facebook AI Similarity Search) - A library for efficient similarity search and clustering of dense vectors.

Installing Dependencies

You can install all dependencies by running the following command:

pip install langchain openai textract transformers langchain faiss-cpu pypdf tiktoken

How It Works

The PdfGptIndexer operates in several stages:

  1. It first processes a specified folder of PDF documents, extracting the text and splitting it into manageable chunks using Hugging Face Transformers library.
  2. Each text chunk is then embedded using the default OpenAI embedding model (text-embedding-ada-002) through the LangChain library.
  3. These embeddings are stored in a FAISS index, providing a compact and efficient storage method.
  4. Finally, a query interface allows you to retrieve relevant information from the indexed data by asking questions. The application fetches and displays the most relevant text chunk.

Untitled-2023-06-16-1537

Advantages of Storing Embeddings Locally

Storing embeddings locally provides several advantages:

  1. Speed: Once the embeddings are stored, retrieval of data is significantly faster as there's no need to compute embeddings in real-time.
  2. Offline access: After the initial embedding creation, the data can be accessed offline.
  3. Compute Savings: You only need to compute the embeddings once and reuse them, saving computational resources.
  4. Scalability: This makes it feasible to work with large datasets that would be otherwise difficult to process in real-time.

Running the Program

To run the program, you should:

  1. Make sure you have installed all dependencies.
  2. Clone the repository to your local machine.
  3. Navigate to the directory containing the Python script.
  4. Replace "" with your actual OpenAI API key in the script.
  5. Finally, run the script with Python.
    python3 pdf_gpt_indexer.py

Please ensure that the folders specified in the script for PDF documents and the output text files exist and are accessible. The query interface will start after the embeddings are computed and stored. You can exit the query interface by typing 'exit'.

Exploring Custom Data with ChatGPT

Check out the post here for a comprehensive guide on how to utilize ChatGPT with your own custom data.