mayooear / gpt4-pdf-chatbot-langchain

GPT4 & LangChain Chatbot for large PDF docs
https://www.youtube.com/watch?v=ih9PBGVVOO4
14.76k stars 3k forks source link

Failed to ingest data - invalid PDF exception #162

Closed umanglani closed 9 months ago

umanglani commented 1 year ago

I am getting an error while trying to ingest PDFs which are provided in the repo. I have setup openai and pinecode API keys and index name etc.

error [InvalidPDFException: Invalid PDF structure] ...\gpt4-pdf-chatbot-langchain\scripts\ingest-data.ts:45 throw new Error('Failed to ingest your data');

DialloYoussouf commented 1 year ago

Try this with your error_message https://github.com/mayooear/gpt4-pdf-chatbot-langchain/issues/157#issuecomment-1506578204

umanglani commented 1 year ago

Try this with your error_message #157 (comment)

sorry I didn't follow. The error is different and I can't ask chatgpt, office computer. Using node 18 by the way.

mayooear commented 1 year ago

Try a different PDF to see if it's an issue with a corrupted pdf.

bschleter commented 1 year ago

Unfortunately try with a different pdf. I got similar errors with _ds.store errors and was rather difficult to figure out which one it was.

I've done this repo couple times over now, and I've come to find PDFs from certain public places/old/lot of images will possibly lead to failure as pdf parser is unable to parse through.

The issue is likely your pdf. See if there is an outside way to convert your pdf to text, then try. You'll need to add a text loader to the code. If you don't want to do that, a workaround is convert pdf to text, then save that txt file back to pdf without the original elements. It'll be crude, but it'll at least get it working until you can figure out the real problem.

dosubot[bot] commented 9 months ago

Hi, @umanglani! I'm Dosu, and I'm helping the gpt4-pdf-chatbot-langchain team manage their backlog. I wanted to let you know that we are marking this issue as stale.

From what I understand, you were experiencing an error while trying to ingest PDFs into the repository. DialloYoussouf suggested a solution in a comment, but you mentioned that you couldn't follow it due to not having access to a chatgpt office computer. Additionally, mayooear recommended trying a different PDF to see if it's a corrupted file, and bschleter mentioned that certain PDFs from public places or with lots of images can cause failures. They suggested converting the PDF to text as a workaround.

Based on this information, it seems that the issue has been resolved with the suggested solution of converting the PDF to text. However, we wanted to confirm with you if the issue is still relevant to the latest version of the gpt4-pdf-chatbot-langchain repository. If it is, please let us know by commenting on the issue. Otherwise, feel free to close the issue yourself, or the issue will be automatically closed in 7 days.

Thank you for your understanding and contribution to the gpt4-pdf-chatbot-langchain repository!