mayooear / gpt4-pdf-chatbot-langchain

GPT4 & LangChain Chatbot for large PDF docs
https://www.youtube.com/watch?v=ih9PBGVVOO4
14.95k stars 3.02k forks source link

I get this error when I open my local server: Cannot read properties of undefined (reading 'text') #461

Closed Darrenf040 closed 4 months ago

Darrenf040 commented 9 months ago

I set up everything like the documentation said and when I run 'npm run dev' to run my local server, the ui give me this error when I try to chat with the bot: Cannot read properties of undefined (reading 'text').

dosubot[bot] commented 9 months ago

🤖

Hey @Darrenf040, great to see you back! Hope you're doing well.

The error "Cannot read properties of undefined (reading 'text')" typically occurs when the code is trying to access a property of an object that hasn't been defined. In the context of the gpt4-pdf-chatbot-langchain, this could be due to several reasons:

  1. The OpenAI or Pinecone API keys are not correctly set in the .env file. Make sure you have correctly copied your API keys into the .env file and that they are valid.

  2. The PINECONE_NAME_SPACE in the config folder might not have been correctly set. Ensure that you have replaced it with a namespace where you'd like to store your embeddings on Pinecone.

  3. The modelName in new OpenAI in utils/makechain.ts might not have been correctly set to gpt-4. Ensure that you have access to the gpt-4 API and that you have changed the modelName accordingly.

  4. The PDF files might not have been correctly converted to embeddings. Make sure you have added your PDF files in the docs folder and run the yarn run ingest script to 'ingest' and embed your docs. Also, check your Pinecone dashboard to verify that your namespace and vectors have been added.

  5. There might be an issue with the versions of LangChain and Pinecone you are using. Ensure that you are using the same versions as this repo.

  6. There might be an issue with your Node version. Make sure you are running the latest Node version.

If you have checked all these points and the error still persists, you might want to console.log your env variables to make sure they are exposed and hard code your API keys into the process.env variables. If the error still persists, you might want to retry from scratch with a new Pinecone project, index, and cloned repo. You can refer to the README.md for more details.

Regarding the structure of the object that the 'text' property is being read from, it's being read from the 'parsed' object. The 'parsed' object is the result of the 'pdf' function being called with 'raw' as an argument. The 'pdf' function is imported from the 'pdf-parse/lib/pdf-parse.js' module. The 'text' property contains the content of the PDF document and 'numpages' property contains the number of pages in the PDF document. Here is the relevant code:

const parsed = await pdf(raw);
return [
  new Document({
    pageContent: parsed.text,
    metadata: {
      ...metadata,
      pdf_numpages: parsed.numpages,
    },
  }),
];

You can find this in the utils/customPDFLoader.ts file.

I also found some similar issues in the repository that might help you:

  1. Error when trying to transition from GPT-4 to PaLM
  2. "TypeError: Cannot read properties of undefined (reading 'text')"
  3. Missing value for input

Please check these issues and see if the solutions provided there can help you resolve your problem.

Sources

#### About Dosu This response is meant to be useful and save you time. It is **not meant to be a precise solution**, but rather a starting point for your own research. Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant. If you want to continue the conversation, start your reply with @dosu-bot.
Darrenf040 commented 9 months ago

how to check pinecode and langchain version

Darrenf040 commented 9 months ago

and my model name is: 'gpt-3.5-turbo' since i dont have gpt4

scenaristeur commented 8 months ago

hi @Darrenf040 i have the same error, look at the end of the error message

/gpt4-pdf-chatbot-langchain/scripts/ingest-data.ts:52
    throw new Error('Failed to ingest your data');

for me it says to look in ingest-data.ts ligne 52 . this can help you to find what data is not found. for me an issue with pinecone name space or index

    await PineconeStore.fromDocuments(docs, embeddings, {
      pineconeIndex: index,
      namespace: PINECONE_NAME_SPACE,
      textKey: 'text',
    });
richard523 commented 8 months ago

I believe there are now type errors when ingesting using newest Pinecone types.

I think they want you to now convert the embeddings into vectors and upsert in the new way?

But again, this has conflicts with the "makechain" script.

Here's the new upsert Pinecone wants you to use: "https://docs.pinecone.io/docs/upsert-data"

LMK if you made any progress on cleaning up the types. I'm also incredibly stuck!

ScottBlinman commented 8 months ago

Make sure you are using the podbased pinecode index. The serverless index doesn't work

mowliv commented 8 months ago

I'm stuck on this as well. I have a forked repo with an extended feature set at https://github.com/anandaworldwide/ananda-library-chatbot and it is failing with the error "Cannot read properties of undefined (reading 'text')."

I tried upgrading to langchain 0.1.30 but that didn't help and caused other issues from breaking changes.

What is the "text" textKey parameter here? ChatGPT suggested changing it to pageContent, which is a field in my document data, but on smaller datasets it is finding the content using "text". I haven't located API docs to explain it.

await PineconeStore.fromDocuments(docs, embeddings, {
  pineconeIndex: index,
  textKey: 'text',
});
richard-aoede commented 8 months ago

textKey is the actual text being stored as metadata within Pinecone, I believe. Edit: It's the key of the text once it's stored in the database as metadata.

I've given up on serverless Pinecone with this project because there's unresolved type errors between Pinecone serverless docs and Langchain.JS that I cannot figure out.

From what I've read about this project, pod-based storage is the way to go since serverless Pinecone is still experimental.

richard-aoede commented 8 months ago

I did successfully upsert pdfs in serverless using typescript but when I tried to search it threw errors with the makeChain function for me.

mowliv commented 8 months ago

Yes I'm using the free pod-based. Thanks for textKey info!

On Thu, Mar 28, 2024 at 6:45 PM richard-aoede @.***> wrote:

textKey is the actual text being stored as metadata within Pinecone, I believe.

I've given up on serverless Pinecone with this project because there's unresolved type errors between Pinecone serverless docs and Langchain.JS that I cannot figure out.

From what I've read about this project, pod-based storage is the way to go since serverless Pinecone is still experimental.

— Reply to this email directly, view it on GitHub https://github.com/mayooear/gpt4-pdf-chatbot-langchain/issues/461#issuecomment-2026454859, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABDDXTMMCUM3VUEOGRGJJDTY2TBTXAVCNFSM6AAAAABD3G52LGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAMRWGQ2TIOBVHE . You are receiving this because you commented.Message ID: @.***>

mowliv commented 8 months ago

I just figured out how to duplicate the error and how to fix it. I noticed that sometimes when I get the error, I also get a JavaScript heap out of memory error as a secondary exception. In the test I just did, however, I only got the primary error. But expanding Javascript memory allocation solves it!

Change to line in package.json that fixes it for me: "ingest": "NODE_OPTIONS='--max-old-space-size=4096' tsx -r dotenv/config scripts/ingest-data.ts"

Failure: Failed to embed documents or store in Pinecone: TypeError: Cannot read properties of undefined (reading 'text') at <anonymous> (/Users/Michael/Documents/development/gpt4-pdf-chatbot-langchain-ananda-lib/node_modules/@pinecone-database/pinecone/dist/errors/utils.js:44:57) at step (/Users/Michael/Documents/development/gpt4-pdf-chatbot-langchain-ananda-lib/node_modules/@pinecone-database/pinecone/dist/errors/utils.js:33:23) at Object.next (/Users/Michael/Documents/development/gpt4-pdf-chatbot-langchain-ananda-lib/node_modules/@pinecone-database/pinecone/dist/errors/utils.js:14:53) at <anonymous> (/Users/Michael/Documents/development/gpt4-pdf-chatbot-langchain-ananda-lib/node_modules/@pinecone-database/pinecone/dist/errors/utils.js:8:71) at new Promise (<anonymous>) at __awaiter (/Users/Michael/Documents/development/gpt4-pdf-chatbot-langchain-ananda-lib/node_modules/@pinecone-database/pinecone/dist/errors/utils.js:4:12) at extractMessage (/Users/Michael/Documents/development/gpt4-pdf-chatbot-langchain-ananda-lib/node_modules/@pinecone-database/pinecone/dist/errors/utils.js:40:48) at <anonymous> (/Users/Michael/Documents/development/gpt4-pdf-chatbot-langchain-ananda-lib/node_modules/@pinecone-database/pinecone/dist/errors/handling.js:66:70) at step (/Users/Michael/Documents/development/gpt4-pdf-chatbot-langchain-ananda-lib/node_modules/@pinecone-database/pinecone/dist/errors/handling.js:33:23) at Object.next (/Users/Michael/Documents/development/gpt4-pdf-chatbot-langchain-ananda-lib/node_modules/@pinecone-database/pinecone/dist/errors/handling.js:14:53)

And here's a failure that includes the JavaScript heap out of memory error:

Failed to embed documents or store in Pinecone: TypeError: Cannot read properties of undefined (reading 'text')
    at <anonymous> (/Users/Michael/Documents/development/gpt4-pdf-chatbot-langchain-ananda-lib/node_modules/@pinecone-database/pinecone/dist/errors/utils.js:44:57)
    at step (/Users/Michael/Documents/development/gpt4-pdf-chatbot-langchain-ananda-lib/node_modules/@pinecone-database/pinecone/dist/errors/utils.js:33:23)
    at Object.next (/Users/Michael/Documents/development/gpt4-pdf-chatbot-langchain-ananda-lib/node_modules/@pinecone-database/pinecone/dist/errors/utils.js:14:53)
    at <anonymous> (/Users/Michael/Documents/development/gpt4-pdf-chatbot-langchain-ananda-lib/node_modules/@pinecone-database/pinecone/dist/errors/utils.js:8:71)
    at new Promise (<anonymous>)
    at __awaiter (/Users/Michael/Documents/development/gpt4-pdf-chatbot-langchain-ananda-lib/node_modules/@pinecone-database/pinecone/dist/errors/utils.js:4:12)
    at extractMessage (/Users/Michael/Documents/development/gpt4-pdf-chatbot-langchain-ananda-lib/node_modules/@pinecone-database/pinecone/dist/errors/utils.js:40:48)
    at <anonymous> (/Users/Michael/Documents/development/gpt4-pdf-chatbot-langchain-ananda-lib/node_modules/@pinecone-database/pinecone/dist/errors/handling.js:66:70)
    at step (/Users/Michael/Documents/development/gpt4-pdf-chatbot-langchain-ananda-lib/node_modules/@pinecone-database/pinecone/dist/errors/handling.js:33:23)
    at Object.next (/Users/Michael/Documents/development/gpt4-pdf-chatbot-langchain-ananda-lib/node_modules/@pinecone-database/pinecone/dist/errors/handling.js:14:53)

<--- Last few GCs --->

[95569:0x7fb878008000]   614381 ms: Mark-Compact (reduce) 4065.8 (4122.3) -> 4065.8 (4122.3) MB, 75.23 / 0.00 ms  (average mu = 0.650, current mu = 0.302) allocation failure; scavenge might not succeed

<--- JS stacktrace --->

FATAL ERROR: Reached heap limit Allocation failed - JavaScript heap out of memory
----- Native stack trace -----

 1: 0x102a931d2 node::OOMErrorHandler(char const*, v8::OOMDetails const&) [/usr/local/Cellar/node/21.4.0/bin/node]
 2: 0x102c1dbfd v8::Utils::ReportOOMFailure(v8::internal::Isolate*, char const*, v8::OOMDetails const&) [/usr/local/Cellar/node/21.4.0/bin/node]
 3: 0x102c1db93 v8::internal::V8::FatalProcessOutOfMemory(v8::internal::Isolate*, char const*, v8::OOMDetails const&) [/usr/local/Cellar/node/21.4.0/bin/node]
 4: 0x102db6a65 v8::internal::Heap::FatalProcessOutOfMemory(char const*) [/usr/local/Cellar/node/21.4.0/bin/node]
 5: 0x102db59d4 v8::internal::Heap::CollectGarbage(v8::internal::AllocationSpace, v8::internal::GarbageCollectionReason, v8::GCCallbackFlags) [/usr/local/Cellar/node/21.4.0/bin/node]
 6: 0x102dad49f v8::internal::HeapAllocator::AllocateRawWithLightRetrySlowPath(int, v8::internal::AllocationType, v8::internal::AllocationOrigin, v8::internal::AllocationAlignment) [/usr/local/Cellar/node/21.4.0/bin/node]
 7: 0x102dadc95 v8::internal::HeapAllocator::AllocateRawWithRetryOrFailSlowPath(int, v8::internal::AllocationType, v8::internal::AllocationOrigin, v8::internal::AllocationAlignment) [/usr/local/Cellar/node/21.4.0/bin/node]
 8: 0x102d95ab2 v8::internal::Factory::AllocateRaw(int, v8::internal::AllocationType, v8::internal::AllocationAlignment) [/usr/local/Cellar/node/21.4.0/bin/node]
 9: 0x102d8ce42 v8::internal::MaybeHandle<v8::internal::SeqOneByteString> v8::internal::FactoryBase<v8::internal::Factory>::NewRawStringWithMap<v8::internal::SeqOneByteString>(int, v8::internal::Tagged<v8::internal::Map>, v8::internal::AllocationType) [/usr/local/Cellar/node/21.4.0/bin/node]
10: 0x102d8cda7 v8::internal::FactoryBase<v8::internal::Factory>::NewStringFromOneByte(v8::base::Vector<unsigned char const>, v8::internal::AllocationType) [/usr/local/Cellar/node/21.4.0/bin/node]
11: 0x102e91a13 v8::internal::JsonStringifier::Stringify(v8::internal::Handle<v8::internal::Object>, v8::internal::Handle<v8::internal::Object>, v8::internal::Handle<v8::internal::Object>) [/usr/local/Cellar/node/21.4.0/bin/node]
12: 0x102e91893 v8::internal::JsonStringify(v8::internal::Isolate*, v8::internal::Handle<v8::internal::Object>, v8::internal::Handle<v8::internal::Object>, v8::internal::Handle<v8::internal::Object>) [/usr/local/Cellar/node/21.4.0/bin/node]
13: 0x102c89128 v8::internal::Builtin_JsonStringify(int, unsigned long*, v8::internal::Isolate*) [/usr/local/Cellar/node/21.4.0/bin/node]
14: 0x102883c76 Builtins_CEntry_Return1_ArgvOnStack_BuiltinExit [/usr/local/Cellar/node/21.4.0/bin/node]
✨  Done in 614.90s.
mowliv commented 8 months ago

I found the above sometimes still failed. So I upgraded langchain to 0.1.30 (and had to adapt the code a bit). But that didn't do it, so I upgraded @pinecone-database/pinecone to 1.1.3, and now it seems to work. It was never a problem when I processed only 4000 PDF files. The problem only came up when I processed my full set of 6000. So I'm guessing there was a memory leak in pinecone that got resolved in a later version. (Tho I'm still verifying things... vector count is 1/2 of before so perhaps I'm not processing as much as I think.)

richard-aoede commented 7 months ago

@mowliv That's awesome that you got a whopping 6000 pdfs! I've got 52 pdfs with ranging from 50 to 300 pages. I'm getting really low scores when trying to search. I saw on the examples they get up to .80 in score. I'm only getting less than 0.10 on my best scoring contexts. Please let me know if you have any more updates! I'll try to share mine on this thread or we can continue through DM's.

richard-aoede commented 7 months ago

Oh and here's my snippet but it's pretty messy lol. Note that I borrowed someone else's idea of having multiple pdfs from a flat directory (ex: "farm_animals/chickens.pdf", "farm_animals/cows.pdf", "home_animals/dogs.pdf", "home_animals/cats.pdf"). USING SERVERLESS pinecone.

import fs from 'fs';
import path from 'path';
import { RecursiveCharacterTextSplitter } from 'langchain/text_splitter';
import { pc } from '@/utils/pinecone-client';
import { PDFLoader } from 'langchain/document_loaders/fs/pdf';
import { CohereEmbeddings } from "@langchain/cohere"; // 1000 API REQUEST RATE LMITI REACHED
// import { OpenAIEmbeddings } from 'langchain/embeddings';
import { PineconeStore } from 'langchain/vectorstores';
import { PINECONE_INDEX_NAME } from '@/config/pinecone';
import { Pinecone } from '@pinecone-database/pinecone';
import { string } from 'cohere-ai/core/schemas';

export const run = async () => {
  try {
    const index = pc.Index(PINECONE_INDEX_NAME); 
    const indexDescription = await pc.describeIndex(PINECONE_INDEX_NAME);

    /* Load all directories */
    const directories = fs
      .readdirSync('./docs')
      .filter((file) => {
        return fs.statSync(path.join('./docs', file)).isDirectory();
      })
      .map((dir) => `./docs/${dir}`); // Add prefix 'docs/' to directory names
    console.log('directories: ', directories);
    for (const directory of directories) {
      /* Load all PDF files in the directory */
      const files = fs
        .readdirSync(directory)
        .filter((file) => path.extname(file) === '.pdf');

      for (const file of files) {
        // console.log(`Processing file: ${file}`);

        /* Load raw docs from the pdf file */
        const filePath = path.join(directory, file);
        const loader = new PDFLoader(filePath);
        const rawDocs = await loader.load();

        // console.log(rawDocs[0].pageContent);
        // console.log(rawText);

        /* Split text into chunks */
        const textSplitter = new RecursiveCharacterTextSplitter({
          chunkSize: 1000,
          chunkOverlap: 200,
        });

        const docs = await textSplitter.splitDocuments(rawDocs);
        // const rawFlattenedDoc = rawDocs.map((doc) => doc.toString());
        // console.log('split docs', docs);
        // console.log(rawDocs);

        console.log('creating vector store...');
        /*create and store the embeddings in the vectorStore*/
        const embeddings = new CohereEmbeddings({
          apiKey: process.env.COHERE_API_KEY});

        // Some helpful logs  

        // console.log("PRINT INDEX: ", index);
        // const indexDescription = await pinecone.describeIndex(PINECONE_INDEX_NAME);
        // console.log('pinecone index description', indexDescription);

        const namespace = path.basename(directory); // use the directory name as the namespace 

        //embed the PDF documents

        /* Pinecone recommends a limit of 100 vectors per upsert request to avoid errors*/
        const chunkSize = 50;

        // Now let's chunk the docs inside the PDF!
        for (let i = 0; i < docs.length; i += chunkSize) {
          const chunk = docs.slice(i, i + chunkSize);
          // console.log('chunk', i, chunk);

          // loads data as array of strings
          const stringArray: string[] = chunk.map((doc) => JSON.stringify(doc.pageContent));
          // console.log(stringArray)
          // doc is loaded as array of embeddings
          const embeddingsArray = await embeddings.embedDocuments(stringArray);

          // now, loop through Cohere's embeddings array of 1024 vectors. 
          // a list of 1024 vectors for each element in the embeddingsArray
          // embeddingsArray = [ [0.124, -0.242, ... 1022 more vectors ], [0.124, -0.242, ... 1022 more vectors ]]  
          // (can swap with other embedding arrays)
          for (let j = 0; j < embeddingsArray.length; j++) {
            // id of the embedding
            const timestamp = Date.now();
            const randomValue = Math.floor(Math.random() * 1000);
            const namespaceID = namespace.concat("_", file, "_", `${timestamp}-${randomValue}`);
            console.log(namespaceID); 
            // from array of arrays to array of size 1024
            const flattenedArray: number[] = embeddingsArray[j]; 
            console.log(flattenedArray)
            // raw string of the flattenedArray (embeddings)
            console.log(stringArray[j]) 

            // required for upsert of namespace
            // ex: ai-ml, philosophy, etc etc
            const namespaceUpsert = index.namespace(namespace);

            await namespaceUpsert.upsert([
              {
                "id": namespaceID,
                "values": flattenedArray,
                "metadata": {
                  "text": stringArray[j],
                  "chunkSize": chunkSize,
                  "topic": directory,
                  "source": file,
                  "chunk_i": i,
                  "chunk_j": j}
              },
            ]);

          }
          // UNCOMMENT 2nd for loop here

          // OLD AND BROKEN?
          // await PineconeStore.fromDocuments(
          //   chunk,
          //   embeddings,
          //   {
          //     pineconeIndex: index,
          //     maxConcurrency: 5, // Maximum number of batch requests to allow at once. Each batch is 1000 vectors.
          //   }
          // );
          // await index.namespace('pdf-chat-serverless').upsert(docs);

        }

        console.log(`File ${file} processed`);

      }

    }

  } catch (error) {
    console.log('error', error);
    throw new Error('Failed to ingest your data');
  }
};

(async () => {
  await run();
  console.log('completed ingestion of all PDF files in all directories');
})();