Explain data ingestion code.

I'm seeking assistance in understanding a code implementation within my NestJS application. The API is designed to convert PDF documents into vectors and subsequently store them within a Pinecone database. However, upon data ingestion, I'm encountering a discrepancy in the number of vectors stored compared to the expected output from the original code.

To illustrate, executing "yarn ingest" within the original code yields 972 vectors. Yet, when attempting the same process with my code, utilizing an identical file, the resulting number of vectors is 950. Why is that so?! What am I doing wrong?

Please advise on potential factors contributing to this discrepancy and any corrective measures I can implement to achieve consistency.

My NEST JS Code:

import { OpenAIEmbeddings } from 'langchain/embeddings/openai';
import { RecursiveCharacterTextSplitter } from 'langchain/text_splitter';
import { PineconeStore } from 'langchain/vectorstores/pinecone';

 /**
   * @description This API is linked with s3 File upload API. This API is used to create vector store from the uploaded PDF file.
   * @param {string} url : S3 URL of the uploaded PDF file
   * @param {string} namespace : Name of the vector store namespace (unique)
   */
  async create(url: string, namespace: string): Promise<void> {
    if (!url) throw new BadRequestException('URL is required');
    if (!namespace) throw new BadRequestException('Vector Name is required');

    try {
      const rawDocs = await this.loadPDF(url as string);

      const textSplitter = new RecursiveCharacterTextSplitter({
        chunkSize: 1000,
        chunkOverlap: 200,
      });

      const docs = await textSplitter.splitDocuments(rawDocs);

      console.log('Creating vector store...');

      /*create and store the embeddings in the vectorStore*/
      const embeddings = new OpenAIEmbeddings({
        openAIApiKey: process.env.OPEN_AI_SK as string,
      });

      const index = this.pinecone.Index(
        process.env.PINECONE_INDEX_NAME as string,
      ); 

      await PineconeStore.fromDocuments(docs, embeddings, {
        pineconeIndex: index,
        namespace,
        textKey: 'text',
      });
    } catch (err) {
      const errorMessage =
        err.response?.data?.error?.message ?? "Couldn't create vector store";

      Logger.error(errorMessage, err.stack, 'LangchainService');

      throw new HttpException(
        {
          status: HttpStatusCode.InternalServerError,
          error: errorMessage,
        },
        HttpStatusCode.InternalServerError,
      );
    }
  }

🤖

Hello @araza110! 🚀

I'm Dosu, your friendly neighborhood bot. While human maintainers are busy saving the world, I'm here to assist you with bugs, answer your questions, and help you become a stellar contributor. Let's get those vector counts aligning, shall we? 😉👾

The discrepancy in the number of vectors between your NestJS implementation and the original code could be due to several factors. Here are a few possibilities:

Different Text Splitting Parameters: The RecursiveCharacterTextSplitter is used to split the text into chunks. If the chunkSize and chunkOverlap parameters are different in your NestJS implementation, this could result in a different number of vectors. In your code, you've set chunkSize to 1000 and chunkOverlap to 200. If these values are different in the original code, it could explain the discrepancy.
Different Document Loaders: The method you use to load the PDF files could also affect the number of vectors. If your NestJS implementation uses a different method to load the PDF files, it could result in a different number of vectors.
Different Embedding Method: The OpenAIEmbeddings is used to create embeddings from the split text. If your NestJS implementation uses a different method to create embeddings, it could result in a different number of vectors.

Without more information, it's impossible to provide a more specific answer. Could you please provide the original code and the specific parameters used there? This would help in diagnosing the issue more accurately.

Sources

scripts/ingest-data.ts

This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.

Hi,@dosbut Please explain this code block

  const textSplitter = new RecursiveCharacterTextSplitter({
        chunkSize: 1000,
        chunkOverlap: 200,
      });

Also what is textKey?

 await PineconeStore.fromDocuments(docs, embeddings, {
        pineconeIndex: index,
        namespace,
        textKey: 'text',
      });

@araza110 textSplitter splits PDF documents into smaller parts called "chunks". chunkSize - each chunk should have a maximum of 1000 characters. chunkOverlap - how much overlap there should be between adjacent chunks (200 characters)

say we load a document containing:

"Lorem ipsum dolor sit amet, consectetur adipiscing elit. Sed vitae mauris nec nisl imperdiet placerat. Nullam in semper velit. Phasellus congue varius felis, in efficitur tellus tempor sit amet. Duis mollis tincidunt ex, ac feugiat metus consectetur ut. Sed tincidunt lacus sed felis fringilla, eu blandit massa ullamcorper. Aliquam pulvinar, nisl eu consectetur dignissim, mi erat fermentum sapien, eu ultrices turpis lectus a nunc."

const rawDocs = await directoryLoader.load();

const textSplitter = new RecursiveCharacterTextSplitter({
  chunkSize: 1000,
  chunkOverlap: 200,
});

const docs = await textSplitter.splitDocuments(rawDocs);
console.log('split docs', docs);

We get:

[
  "Lorem ipsum dolor sit amet, consectetur adipiscing elit. Sed vitae mauris nec nisl imperdiet placerat. Nullam in semper velit. Phasellus congue varius felis, in efficitur tellus tempor sit amet. Duis mollis tincidunt ex, ",
  "ac feugiat metus consectetur ut. Sed tincidunt lacus sed felis fringilla, eu blandit massa ullamcorper. Aliquam pulvinar, nisl eu consectetur dignissim, mi erat fermentum sapien, eu ultrices turpis lectus a nunc."
]

For:

await PineconeStore.fromDocuments(docs, embeddings, {
  pineconeIndex: index,
  namespace,
  textKey: 'text',
});

textKey tells the code which part of each document should be used for generating embeddings. In this case, it expects each element in the docs array to have a property called 'text'. Each document object has a text property containing the text to be used for generating embeddings.

{ 
    text: "Lorem ipsum dolor sit amet, consectetur adipiscing elit. Sed vitae mauris nec nisl imperdiet placerat. Nullam in semper velit. Phasellus congue varius felis, in efficitur tellus tempor sit amet. Duis mollis tincidunt ex, ", 
    text:  "ac feugiat metus consectetur ut. Sed tincidunt lacus sed felis fringilla, eu blandit massa ullamcorper. Aliquam pulvinar, nisl eu consectetur dignissim, mi erat fermentum sapien, eu ultrices turpis lectus a nunc."
};

Hi, @araza110,

I'm helping the gpt4-pdf-chatbot-langchain team manage their backlog and am marking this issue as stale. The issue involves a discrepancy in the number of vectors stored when ingesting PDF documents into a Pinecone database using a NestJS application. It seems that the discrepancy has been resolved, with potential factors contributing to the issue identified and addressed. Additionally, a detailed explanation of the textSplitter and textKey parameters was provided to help achieve consistency in the number of vectors generated.

Could you please confirm if this issue is still relevant to the latest version of the gpt4-pdf-chatbot-langchain repository? If it is, please let the gpt4-pdf-chatbot-langchain team know by commenting on the issue. Otherwise, feel free to close the issue yourself, or it will be automatically closed in 7 days. Thank you!

mayooear / gpt4-pdf-chatbot-langchain

Explain data ingestion code. #451

Sources