mayooear / gpt4-pdf-chatbot-langchain

GPT4 & LangChain Chatbot for large PDF docs
https://www.youtube.com/watch?v=ih9PBGVVOO4
14.73k stars 3k forks source link

Fail to ingest data #468

Open fraction94 opened 1 month ago

fraction94 commented 1 month ago

I have followed the readme instructions and created an index in Pinecone. However, still fail to ingest data and keep on getting this error. I dont know what am I doing wrong. I am fairly new to this and trying to learn along the way. any help is much appreciated.

`creating vector store... error TypeError: Cannot read properties of undefined (reading 'text') at C:\Python\gpt4-pdf\node_modules\@pinecone-database\pinecone\dist\errors\utils.js:44:57
at step (C:\Python\gpt4-pdf\node_modules\@pinecone-database\pinecone\dist\errors\utils.js:33:23) at Object.next (C:\Python\gpt4-pdf\node_modules\@pinecone-database\pinecone\dist\errors\utils.js:14:53) at C:\Python\gpt4-pdf\node_modules\@pinecone-database\pinecone\dist\errors\utils.js:8:71
at new Promise () at __awaiter (C:\Python\gpt4-pdf\node_modules\@pinecone-database\pinecone\dist\errors\utils.js:4:12) at extractMessage (C:\Python\gpt4-pdf\node_modules\@pinecone-database\pinecone\dist\errors\utils.js:40:48) at C:\Python\gpt4-pdf\node_modules\@pinecone-database\pinecone\dist\errors\handling.js:66:70 at step (C:\Python\gpt4-pdf\node_modules\@pinecone-database\pinecone\dist\errors\handling.js:33:23) at Object.next (C:\Python\gpt4-pdf\node_modules\@pinecone-database\pinecone\dist\errors\handling.js:14:53)

file:///C:/Python/gpt4-pdf/scripts/ingest-data.ts:46 throw new Error('Failed to ingest your data'); ^ Error: Failed to ingest your data at run (file:///C:/Python/gpt4-pdf/scripts/ingest-data.ts:46:11) at processTicksAndRejections (node:internal/process/task_queues:95:5) at file:///C:/Python/gpt4-pdf/scripts/ingest-data.ts:51:3 error Command failed with exit code 1. info Visit https://yarnpkg.com/en/docs/cli/run for documentation about this command. `

dosubot[bot] commented 1 month ago

The error "TypeError: Cannot read properties of undefined (reading 'text')" usually occurs when the documents being processed do not have the expected text property. Hereโ€™s how you can troubleshoot and fix this issue:

Steps to Troubleshoot

  1. Check Document Structure: Ensure that the documents loaded by directoryLoader.load() have the text property.

  2. Verify Split Documents: Ensure that the documents returned by textSplitter.splitDocuments(rawDocs) have the text property.

  3. Add Logging: Modify your code to log the structure of the documents at each step to inspect them.

Hereโ€™s the modified code with additional logging:

import { RecursiveCharacterTextSplitter } from 'langchain/text_splitter';
import { OpenAIEmbeddings } from 'langchain/embeddings/openai';
import { PineconeStore } from 'langchain/vectorstores/pinecone';
import { pinecone } from '@/utils/pinecone-client';
import { PDFLoader } from 'langchain/document_loaders/fs/pdf';
import { PINECONE_INDEX_NAME, PINECONE_NAME_SPACE } from '@/config/pinecone';
import { DirectoryLoader } from 'langchain/document_loaders/fs/directory';

const filePath = 'docs';

export const run = async () => {
  try {
    const directoryLoader = new DirectoryLoader(filePath, {
      '.pdf': (path) => new PDFLoader(path),
    });

    const rawDocs = await directoryLoader.load();
    console.log('rawDocs', rawDocs); // Log rawDocs to inspect structure

    const textSplitter = new RecursiveCharacterTextSplitter({
      chunkSize: 1000,
      chunkOverlap: 200,
    });

    const docs = await textSplitter.splitDocuments(rawDocs);
    console.log('split docs', docs); // Log split docs to inspect structure

    console.log('creating vector store...');
    const embeddings = new OpenAIEmbeddings();
    const index = pinecone.Index(PINECONE_INDEX_NAME);

    await PineconeStore.fromDocuments(docs, embeddings, {
      pineconeIndex: index,
      namespace: PINECONE_NAME_SPACE,
      textKey: 'text',
    });
  } catch (error) {
    console.log('error', error);
    throw new Error('Failed to ingest your data');
  }
};

(async () => {
  await run();
  console.log('ingestion complete');
})();

Additional Checks

OPENAI_API_KEY=
PINECONE_API_KEY=  
PINECONE_ENVIRONMENT=
PINECONE_INDEX_NAME=

By following these steps, you should be able to identify and resolve the issue. If you continue to encounter problems, please provide the output of the logged structures for further assistance.

#### Details **Sources** - [Cannot read properties of undefined (reading 'text')](https://github.com/mayooear/gpt4-pdf-chatbot-langchain/discussions/465) - [Error: Failed to Ingest](https://github.com/mayooear/gpt4-pdf-chatbot-langchain/discussions/6) - [Ingest Error [Error: PineconeClient: Error calling upsert: TypeError]](https://github.com/mayooear/gpt4-pdf-chatbot-langchain/discussions/106) - [Error when trying to transition from GPT-4 to PaLM](https://github.com/mayooear/gpt4-pdf-chatbot-langchain/issues/423) - [Unable to connect with PineconeStore -> TypeError: Cannot read properties of undefined (reading 'text')](https://github.com/mayooear/gpt4-pdf-chatbot-langchain/issues/467) - [error [ErrorWithoutStackTrace: PineconeClient: Error calling upsert: ErrorWithoutStackTrace: undefined]](https://github.com/mayooear/gpt4-pdf-chatbot-langchain/issues/207) - [npm run ingest / Pinecone error ](https://github.com/mayooear/gpt4-pdf-chatbot-langchain/issues/350) - [scripts/ingest-data.ts](https://github.com/mayooear/gpt4-pdf-chatbot-langchain/blob/main/scripts/ingest-data.ts) - [.env.example](https://github.com/mayooear/gpt4-pdf-chatbot-langchain/blob/main/.env.example) - [config/pinecone.ts](https://github.com/mayooear/gpt4-pdf-chatbot-langchain/blob/main/config/pinecone.ts) - [utils/pinecone-client.ts](https://github.com/mayooear/gpt4-pdf-chatbot-langchain/blob/main/utils/pinecone-client.ts)
**About Dosu** This response is meant to be useful and save you time. It is **not meant to be a precise solution**, but rather a starting point for your own research. Help me be more useful! Please leave a ๐Ÿ‘ if this is helpful and ๐Ÿ‘Ž if it is irrelevant.

To continue the conversation, mention @dosu.

fraction94 commented 1 month ago

Apologies for being so dumb, but which part of the program should I use this code in?

dosubot[bot] commented 1 month ago

You should place the provided code in the scripts/ingest-data.ts file. This file is responsible for ingesting data into Pinecone. Here is the complete code for scripts/ingest-data.ts with the added logging for inspection:

import { RecursiveCharacterTextSplitter } from 'langchain/text_splitter';
import { OpenAIEmbeddings } from 'langchain/embeddings/openai';
import { PineconeStore } from 'langchain/vectorstores/pinecone';
import { pinecone } from '@/utils/pinecone-client';
import { PDFLoader } from 'langchain/document_loaders/fs/pdf';
import { PINECONE_INDEX_NAME, PINECONE_NAME_SPACE } from '@/config/pinecone';
import { DirectoryLoader } from 'langchain/document_loaders/fs/directory';

const filePath = 'docs';

export const run = async () => {
  try {
    const directoryLoader = new DirectoryLoader(filePath, {
      '.pdf': (path) => new PDFLoader(path),
    });

    const rawDocs = await directoryLoader.load();
    console.log('rawDocs', rawDocs); // Log rawDocs to inspect structure

    const textSplitter = new RecursiveCharacterTextSplitter({
      chunkSize: 1000,
      chunkOverlap: 200,
    });

    const docs = await textSplitter.splitDocuments(rawDocs);
    console.log('split docs', docs); // Log split docs to inspect structure

    console.log('creating vector store...');
    const embeddings = new OpenAIEmbeddings();
    const index = pinecone.Index(PINECONE_INDEX_NAME);

    await PineconeStore.fromDocuments(docs, embeddings, {
      pineconeIndex: index,
      namespace: PINECONE_NAME_SPACE,
      textKey: 'text',
    });
  } catch (error) {
    console.log('error', error);
    throw new Error('Failed to ingest your data');
  }
};

(async () => {
  await run();
  console.log('ingestion complete');
})();

Explanation of the Code

  1. DirectoryLoader: Loads all PDF files from the specified directory (docs).
  2. Logging rawDocs: Logs the raw documents loaded from the directory to inspect their structure.
  3. RecursiveCharacterTextSplitter: Splits the documents into smaller chunks.
  4. Logging docs: Logs the split documents to inspect their structure.
  5. Creating Vector Store: Creates a vector store using the Pinecone index and the embeddings generated from the documents.

Running the Script

To run this script, use the following command in your terminal:

node scripts/ingest-data.ts

Ensure you have all the necessary environment variables set up in your .env file, especially PINECONE_INDEX_NAME and PINECONE_API_KEY.

Troubleshooting

If you encounter the error TypeError: Cannot read properties of undefined (reading 'text'), it indicates that the text property is missing in the documents being processed. Ensure that the documents have the correct structure after being split by the RecursiveCharacterTextSplitter. The expected structure of the docs object is:

[
  {
    "text": "chunk of text",
    "metadata": {
      "source": "source file path or identifier"
    }
  },
  ...
]

If the structure is correct and the error persists, double-check the configuration of the Pinecone index and ensure that the textKey property matches the key used in your documents.

#### Details **Sources** - [Error: Failed to Ingest](https://github.com/mayooear/gpt4-pdf-chatbot-langchain/discussions/6) - [OpenAI API loading changes?](https://github.com/mayooear/gpt4-pdf-chatbot-langchain/discussions/301) - [Error: Failed to ingest your data](https://github.com/mayooear/gpt4-pdf-chatbot-langchain/issues/318) - [npm run ingest / Pinecone error ](https://github.com/mayooear/gpt4-pdf-chatbot-langchain/issues/350) - [npm run ingest error](https://github.com/mayooear/gpt4-pdf-chatbot-langchain/issues/261) - [Running Ingest from Colab](https://github.com/mayooear/gpt4-pdf-chatbot-langchain/issues/56) - [pages/api/chat.ts](https://github.com/mayooear/gpt4-pdf-chatbot-langchain/blob/main/pages/api/chat.ts) - [pages/index.tsx](https://github.com/mayooear/gpt4-pdf-chatbot-langchain/blob/main/pages/index.tsx) - [utils/makechain.ts](https://github.com/mayooear/gpt4-pdf-chatbot-langchain/blob/main/utils/makechain.ts) - [scripts/ingest-data.ts](https://github.com/mayooear/gpt4-pdf-chatbot-langchain/blob/main/scripts/ingest-data.ts) - [package.json](https://github.com/mayooear/gpt4-pdf-chatbot-langchain/blob/main/package.json) - [utils/pinecone-client.ts](https://github.com/mayooear/gpt4-pdf-chatbot-langchain/blob/main/utils/pinecone-client.ts) - [config/pinecone.ts](https://github.com/mayooear/gpt4-pdf-chatbot-langchain/blob/main/config/pinecone.ts) - [utils/customPDFLoader.ts](https://github.com/mayooear/gpt4-pdf-chatbot-langchain/blob/main/utils/customPDFLoader.ts)
**About Dosu** This response is meant to be useful and save you time. It is **not meant to be a precise solution**, but rather a starting point for your own research. Help me be more useful! Please leave a ๐Ÿ‘ if this is helpful and ๐Ÿ‘Ž if it is irrelevant.

To continue the conversation, mention @dosu.

fraction94 commented 1 month ago

i have tried it but still not working. I get the following error. my pdf is simplest document of few pages which I converted from word to PDF so it is text property. looks like OpenAI works fine but Pinecone is not. any help is greatly appreciated.

Document { pageContent: '8 \n' + ' \n' + 'diverse cultures, Rocky Mountains, Niagara Falls, hospitality and Canadian cities. The greatest \n' + 'Canadians that you should know include; Wayne Gretzky. Tommy Douglas, Dr. Roberta Bondar, \n' + 'Pierre Trudeau, and Terrance Stanley Fox. The five common Canadian musicians include \n' + 'Leonard Cohen, Celine Dion, The Tragically Hip (Gord Downie as lead singer), Joni Mitchell \n' + 'and Shania Twain. Canada has had great inventions which have been impacts to the world the \n' + 'inventors are Alexander Graham Bell (telephone), Mathew Evans and Henry Woodward (first \n' + 'electric bulb), Sir Sandford Fleming (standard time), James Naismith (basketball), and Arthur \n' + 'Sicard (snowblower).', metadata: { source: 'C:\Python\gpt4-pdf\docs\testcase.pdf', pdf: [Object], loc: [Object] } } ] creating vector store... error TypeError: Cannot read properties of undefined (reading 'text') at C:\Python\gpt4-pdf\node_modules\@pinecone-database\pinecone\dist\errors\utils.js:44:57 at step (C:\Python\gpt4-pdf\node_modules\@pinecone-database\pinecone\dist\errors\utils.js:33:23) at Object.next (C:\Python\gpt4-pdf\node_modules\@pinecone-database\pinecone\dist\errors\utils.js:14:53) at C:\Python\gpt4-pdf\node_modules\@pinecone-database\pinecone\dist\errors\utils.js:8:71 at new Promise () at __awaiter (C:\Python\gpt4-pdf\node_modules\@pinecone-database\pinecone\dist\errors\utils.js:4:12) at extractMessage (C:\Python\gpt4-pdf\node_modules\@pinecone-database\pinecone\dist\errors\utils.js:40:48)
at C:\Python\gpt4-pdf\node_modules\@pinecone-database\pinecone\dist\errors\handling.js:66:70 at step (C:\Python\gpt4-pdf\node_modules\@pinecone-database\pinecone\dist\errors\handling.js:33:23) at Object.next (C:\Python\gpt4-pdf\node_modules\@pinecone-database\pinecone\dist\errors\handling.js:14:53)

file:///C:/Python/gpt4-pdf/scripts/ingest-data.ts:39 throw new Error('Failed to ingest your data'); ^ Error: Failed to ingest your data at run (file:///C:/Python/gpt4-pdf/scripts/ingest-data.ts:39:11) at processTicksAndRejections (node:internal/process/task_queues:95:5) at file:///C:/Python/gpt4-pdf/scripts/ingest-data.ts:44:3 error Command failed with exit code 1. info Visit https://yarnpkg.com/en/docs/cli/run for documentation about this command.

Also this is how i setup my .env file

OPENAI_API_KEY=sk-proj-zxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx PINECONE_API_KEY=4d8dxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx PINECONE_ENVIRONMENT=us-east-1 PINECONE_INDEX_NAME=pdf

dosubot[bot] commented 1 month ago

The error "TypeError: Cannot read properties of undefined (reading 'text')" typically occurs when the code is trying to access the 'text' property of an object that is undefined. This suggests that the documents being passed to Pinecone are not in the expected format.

Steps to Resolve the Issue

  1. Verify Document Structure: Ensure that each document has a text property before passing it to Pinecone. Add the following logging to your ingest-data.ts script to inspect the structure of the documents after they are split:

    const docs = await textSplitter.splitDocuments(rawDocs);
    console.log('split docs', docs);
    
    // Check if all documents have the 'text' property
    docs.forEach((doc, index) => {
     if (!doc.text) {
       console.error(`Document at index ${index} is missing the 'text' property`, doc);
     }
    });
  2. Ensure Correct Pinecone Configuration: Make sure your .env file is correctly set up with the necessary Pinecone configuration. Ensure your .env file includes the PINECONE_INDEX_NAME variable:

    PINECONE_INDEX_NAME=pdf
  3. Check CustomPDFLoader Implementation: The CustomPDFLoader class should ensure that each document has a text property. Verify that the pageContent property of the Document object is set to parsed.text, which should contain the text content of the PDF.

    export class CustomPDFLoader extends BufferLoader {
     public async parse(
       raw: Buffer,
       metadata: Document['metadata'],
     ): Promise<Document[]> {
       const { pdf } = await PDFLoaderImports();
       const parsed = await pdf(raw);
       return [
         new Document({
           pageContent: parsed.text,
           metadata: {
             ...metadata,
             pdf_numpages: parsed.numpages,
           },
         }),
       ];
     }
    }
  4. Debugging Pinecone Ingestion: Add more detailed logging around the Pinecone ingestion process to identify any issues:

    try {
     const docs = await textSplitter.splitDocuments(rawDocs);
     console.log('split docs', docs);
    
     // Check if all documents have the 'text' property
     docs.forEach((doc, index) => {
       if (!doc.text) {
         console.error(`Document at index ${index} is missing the 'text' property`, doc);
       }
     });
    
     // Ingest documents into Pinecone
     await PineconeStore.fromDocuments(docs, {
       pineconeIndexName: PINECONE_INDEX_NAME,
       textKey: 'text',
     });
     console.log('Successfully ingested documents into Pinecone');
    } catch (error) {
     console.error('Failed to ingest documents into Pinecone:', error);
     throw new Error('Failed to ingest your data');
    }

By following these steps, you should be able to identify and resolve the issue causing the "TypeError: Cannot read properties of undefined (reading 'text')" error.

#### Details **Sources** - [Error: Failed to Ingest](https://github.com/mayooear/gpt4-pdf-chatbot-langchain/discussions/6) - ["TypeError: Cannot read properties of undefined (reading 'text')"](https://github.com/mayooear/gpt4-pdf-chatbot-langchain/issues/455) - [Error: Failed to ingest your data](https://github.com/mayooear/gpt4-pdf-chatbot-langchain/issues/318) - [Unable to connect with PineconeStore -> TypeError: Cannot read properties of undefined (reading 'text')](https://github.com/mayooear/gpt4-pdf-chatbot-langchain/issues/467) - [error [ErrorWithoutStackTrace: PineconeClient: Error calling upsert: ErrorWithoutStackTrace: undefined]](https://github.com/mayooear/gpt4-pdf-chatbot-langchain/issues/207) - [npm run ingest / Pinecone error ](https://github.com/mayooear/gpt4-pdf-chatbot-langchain/issues/350) - [scripts/ingest-data.ts](https://github.com/mayooear/gpt4-pdf-chatbot-langchain/blob/main/scripts/ingest-data.ts) - [config/pinecone.ts](https://github.com/mayooear/gpt4-pdf-chatbot-langchain/blob/main/config/pinecone.ts) - [utils/customPDFLoader.ts](https://github.com/mayooear/gpt4-pdf-chatbot-langchain/blob/main/utils/customPDFLoader.ts) - [package.json](https://github.com/mayooear/gpt4-pdf-chatbot-langchain/blob/main/package.json)
**About Dosu** This response is meant to be useful and save you time. It is **not meant to be a precise solution**, but rather a starting point for your own research. Help me be more useful! Please leave a ๐Ÿ‘ if this is helpful and ๐Ÿ‘Ž if it is irrelevant.

To continue the conversation, mention @dosu.