mayooear / gpt4-pdf-chatbot-langchain

GPT4 & LangChain Chatbot for large PDF docs
https://www.youtube.com/watch?v=ih9PBGVVOO4
14.95k stars 3.02k forks source link

Error: Failed to ingest your data #318

Closed umerarif01 closed 1 year ago

umerarif01 commented 1 year ago

Hi! I am getting this error. I have setup all of my env variables correctly. I don't know why it is not working.

PS C:\Users\UMER ARIF\Desktop\Projects\gpt4-pdf-chatbot-langchain> npm run ingest

> gpt4-langchain-pdf-chatbot@0.1.0 ingest    
> tsx -r dotenv/config scripts/ingest-data.ts

[WARN] Importing from 'langchain/document_loaders' is deprecated. Import from eg. 'langchain/document_loaders/fs/text' or 'langchain/document_loaders/web/cheerio' instead. See https://js.langchain.com/docs/getting-started/install#updating-from-0052 for upgrade instructions.
error TypeError: Object.hasOwn is not a function
    at null.DirectoryLoader (c:/Users/UMER%20ARIF/Desktop/Projects/gpt4-pdf-chatbot-langchain/node_modules/langchain/dist/document_loaders/fs/directory.js:41:24)
    at null.run (c:\Users\UMER ARIF\Desktop\Projects\gpt4-pdf-chatbot-langchain\scripts\ingest-data.ts:15:29)
    at null.<anonymous> (c:\Users\UMER ARIF\Desktop\Projects\gpt4-pdf-chatbot-langchain\scripts\ingest-data.ts:49:9)
    at null.<anonymous> (c:\Users\UMER ARIF\Desktop\Projects\gpt4-pdf-chatbot-langchain\scripts\ingest-data.ts:51:1)
    at processTicksAndRejections (node:internal/process/task_queues:96:5)
c:\Users\UMER ARIF\Desktop\Projects\gpt4-pdf-chatbot-langchain\scripts\ingest-data.ts:44
    throw new Error('Failed to ingest your data');
          ^

Error: Failed to ingest your data
    at null.run (c:\Users\UMER ARIF\Desktop\Projects\gpt4-pdf-chatbot-langchain\scripts\ingest-data.ts:44:11)
    at null.<anonymous> (c:\Users\UMER ARIF\Desktop\Projects\gpt4-pdf-chatbot-langchain\scripts\ingest-data.ts:49:9)
    at null.<anonymous> (c:\Users\UMER ARIF\Desktop\Projects\gpt4-pdf-chatbot-langchain\scripts\ingest-data.ts:51:1)
    at processTicksAndRejections (node:internal/process/task_queues:96:5)
khalidfarooq commented 1 year ago

facing the same issue

umerarif01 commented 1 year ago

facing the same issue

Let me know bro if you find a solution.

nexty5870 commented 1 year ago

Make sure you have your .env setup right -

PINECONE_INDEX_NAME is the name of your index ( I got it confuse at first and run into the same issue as you had) changed it and it worked

EgyptianBrince commented 1 year ago

facing same issue

wail-asad commented 1 year ago

@umerarif01 @khalidfarooq @EgyptianBrince Make sure from Pinecone at create index, the Max Dimensions must be 1536.

khalidfarooq commented 1 year ago

The configuration is correct index name , dimensions, environment Still it's not working

bookofbash commented 1 year ago

I ended up using PDFLoader import { PDFLoader } from "langchain/document_loaders/fs/pdf"; Then the first part of the code to be set up like this:

/ Name of directory to retrieve your files from /

const filePath = 'docs/LLM3.pdf'; //use your filename

export const run = async () => {
  try {
    /*load raw docs from the all files in the directory */
    const directoryLoader = new PDFLoader(filePath, {
      // you may need to add `.then(m => m.default)` to the end of the import
      pdfjs: () => import("pdfjs-dist/legacy/build/pdf.js").then(m => m.default),
    });
    // const loader = new PDFLoader(filePath);
    const rawDocs = await directoryLoader.load();
umerarif01 commented 1 year ago

@umerarif01 @khalidfarooq @EgyptianBrince Make sure from Pinecone at create index, the Max Dimensions must be 1536.

Despite setting the dimensions to 1536, the script was still giving error. However, I was able to resolve the issue by utilizing the following Python script to ingest the document. As a result, the chatbot is now functioning smoothly.

You can access the Python script at the following link: https://github.com/ucl98/pinecone_ingest_python_implementation

Make sure to follow all of the instructions properly if you are going to use it.

shiruken1 commented 1 year ago

Same here. Trying to figure out what Pinecone's error says but I can't make heads or tails of the error's structure.

data: { error: [Object] }

When I try to log the actual error object, I get undefined :man_shrugging:

EgyptianBrince commented 1 year ago

its just a formatting error because langchain had a new update, replace line 13 with this

export const run = async () => { try { /load raw docs from the all files in the directory / const directoryLoader = new DirectoryLoader(filePath, { '.pdf': (path) => new CustomPDFLoader(path, '/pdf'), });

Itll work fine afterwards (remember to save file)

Essentially all your doing is adding the ", '/pdf'" in the new DirectoryLoader

EgyptianBrince commented 1 year ago

@umerarif01 @khalidfarooq @EgyptianBrince Make sure from Pinecone at create index, the Max Dimensions must be 1536.

its just a formatting error because langchain had a new update, replace line 13 with this

export const run = async () => { try { /load raw docs from the all files in the directory / const directoryLoader = new DirectoryLoader(filePath, { '.pdf': (path) => new CustomPDFLoader(path, '/pdf'), });

Itll work fine afterwards (remember to save file)

Essentially all your doing is adding the ", '/pdf'" in the new DirectoryLoader

EgyptianBrince commented 1 year ago

facing the same issue

its just a formatting error because langchain had a new update, replace line 13 with this

export const run = async () => { try { /load raw docs from the all files in the directory / const directoryLoader = new DirectoryLoader(filePath, { '.pdf': (path) => new CustomPDFLoader(path, '/pdf'), });

Itll work fine afterwards (remember to save file)

Essentially all your doing is adding the ", '/pdf'" in the new DirectoryLoader

dosubot[bot] commented 1 year ago

Hi, @umerarif01! I'm Dosu, and I'm helping the gpt4-pdf-chatbot-langchain team manage their backlog. I wanted to let you know that we are marking this issue as stale.

From what I understand, you were encountering an error while trying to ingest data, and other users like "khalidfarooq", "nexty5870", and "EgyptianBrince" have faced the same issue. User "nexty5870" suggested checking the .env setup, while user "wail-asad" recommended ensuring that the Max Dimensions is set to 1536 in Pinecone. User "bookofbash" also shared a workaround using a different code snippet. Eventually, you were able to resolve the issue by using a Python script for ingestion.

Before we close this issue, we wanted to check if it is still relevant to the latest version of the gpt4-pdf-chatbot-langchain repository. If it is, please let us know by commenting on the issue. Otherwise, feel free to close the issue yourself, or it will be automatically closed in 7 days.

Thank you for your understanding and contribution to the project!