mayooear / gpt4-pdf-chatbot-langchain

GPT4 & LangChain Chatbot for large PDF docs
https://www.youtube.com/watch?v=ih9PBGVVOO4
14.73k stars 3k forks source link

Failed to ingest your data #466

Open MuhammadIshaq-AI opened 1 month ago

MuhammadIshaq-AI commented 1 month ago

I am trying to ingest some pdf data using the below ingest.ts code

import { RecursiveCharacterTextSplitter } from 'langchain/text_splitter'; import { OpenAIEmbeddings } from 'langchain/embeddings/openai'; import { PineconeStore } from 'langchain/vectorstores/pinecone'; import { pinecone } from '@/utils/pinecone-client'; import { CustomPDFLoader } from '@/utils/customPDFLoader'; import { PINECONE_INDEX_NAME, PINECONE_NAME_SPACE } from '@/config/pinecone'; import { DirectoryLoader } from 'langchain/document_loaders/fs/directory';

/ Name of directory to retrieve your files from / const filePath = 'new docs';

export const run = async () => { try { / Load raw docs from all files in the directory / const directoryLoader = new DirectoryLoader(filePath, { '.pdf': (path) => new CustomPDFLoader(path), });

const rawDocs = await directoryLoader.load();

// Extracting the file name using regular expressions and updating metadata
const processedDocs = rawDocs.map(doc => {
  const fileName = doc.metadata.source.match(/[^\\\/]+$/)?.[0] || doc.metadata.source;
  const modifiedMetadata = { ...doc.metadata, source: fileName };
  return { ...doc, metadata: modifiedMetadata };
});

/* Split text into chunks */
const textSplitter = new RecursiveCharacterTextSplitter({
  chunkSize: 1000,
  chunkOverlap: 200,
});

const docs = await textSplitter.splitDocuments(processedDocs);
console.log('split docs', docs);

console.log('creating vector store...');
/* Create and store the embeddings in the vectorStore */
const embeddings = new OpenAIEmbeddings();
const index = pinecone.Index(PINECONE_INDEX_NAME); // Change to your own index name

// Embed the PDF documents
await PineconeStore.fromDocuments(docs, embeddings, {
  pineconeIndex: index,
  namespace: PINECONE_NAME_SPACE,
  textKey: 'text',
});

} catch (error) { console.log('error', error); throw new Error('Failed to ingest your data'); } };

(async () => { await run(); console.log('ingestion complete'); })();

I get this error when I run the npm run ingest

creating vector store... error [ErrorWithoutStackTrace: PineconeClient: Error calling upsert: ErrorWithoutStackTrace: PineconeClient: Error calling upsertRaw: FetchError: The request failed and the interceptors did not return an alternative response] c:\Users\Zara\Desktop\devnewgit\chatbot\scripts\ingest-data.ts:51 throw new Error('Failed to ingest your data'); ^

Error: Failed to ingest your data at run (c:\Users\Zara\Desktop\devnewgit\chatbot\scripts\ingest-data.ts:51:11) at process.processTicksAndRejections (node:internal/process/task_queues:95:5) at (c:\Users\Zara\Desktop\devnewgit\chatbot\scripts\ingest-data.ts:56:3)

Node.js v18.16.0

dosubot[bot] commented 1 month ago

The error you're encountering during the upsert operation with the PineconeClient suggests a few potential issues, primarily related to the Pinecone configuration, network issues, data issues, or Pinecone service availability. Here's a more focused approach to troubleshoot and resolve the error:

  1. Verify Pinecone Configuration: Ensure that your Pinecone API key, index name, and any other configuration details are correct. Mistakes in these settings are common sources of errors.

  2. Check Network Connectivity: Network issues can prevent successful communication with Pinecone's servers. Ensure your network connection is stable and check for any firewall or VPN that might block the connection.

  3. Review Data Being Upserted: The format and size of the data you're trying to upsert should meet Pinecone's requirements. Oversized data or incorrect formats can cause failures.

  4. Pinecone Service Status: Occasionally, the problem might be on Pinecone's end. Check their status page for any service disruptions or maintenance activities.

If after these steps the issue persists, consider reviewing the specific configurations of your Pinecone client and the data you're attempting to upsert for more clues. Additionally, exploring discussions and issues in the repository related to similar errors might provide further insights or solutions specific to your scenario.

#### Details **Sources** - [Error: Failed to Ingest](https://github.com/mayooear/gpt4-pdf-chatbot-langchain/discussions/6) - [Ingestion 429 Too Many Requests](https://github.com/mayooear/gpt4-pdf-chatbot-langchain/discussions/196) - [Ingest Error [Error: PineconeClient: Error calling upsert: TypeError]](https://github.com/mayooear/gpt4-pdf-chatbot-langchain/discussions/106) - [Client network socket disconnected before secure TLS connection was established](https://github.com/mayooear/gpt4-pdf-chatbot-langchain/issues/422) - [Error: Failed to ingest your data](https://github.com/mayooear/gpt4-pdf-chatbot-langchain/issues/318) - [Error: Failed to ingest your data](https://github.com/mayooear/gpt4-pdf-chatbot-langchain/issues/108) - [PineconeError: Error, message length too large: found 5453452 bytes, the limit is: 4194304 bytes](https://github.com/mayooear/gpt4-pdf-chatbot-langchain/issues/440) - [error [ErrorWithoutStackTrace: PineconeClient: Error calling upsert: ErrorWithoutStackTrace: undefined]](https://github.com/mayooear/gpt4-pdf-chatbot-langchain/issues/207) - [utils/pinecone-client.ts](https://github.com/mayooear/gpt4-pdf-chatbot-langchain/blob/main/utils/pinecone-client.ts) - [config/pinecone.ts](https://github.com/mayooear/gpt4-pdf-chatbot-langchain/blob/main/config/pinecone.ts) - [pages/api/chat.ts](https://github.com/mayooear/gpt4-pdf-chatbot-langchain/blob/main/pages/api/chat.ts) - [scripts/ingest-data.ts](https://github.com/mayooear/gpt4-pdf-chatbot-langchain/blob/main/scripts/ingest-data.ts)
**About Dosu** This response is meant to be useful and save you time. It is **not meant to be a precise solution**, but rather a starting point for your own research. Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

To continue the conversation, mention @dosu.