gptscript-ai / knowledge

Knowledge for GPTScript
https://gptscript-ai.github.io/knowledge/
Apache License 2.0
24 stars 11 forks source link

"def of non-name" error seen when ingesting some PDF files. #10

Closed sangee2004 closed 4 months ago

sangee2004 commented 4 months ago

Steps to reproduce the problem:

  1. Try to ingest a pdf file this PDF file - 1000-Ways-to-Make-1000-Dollars.pdf

ingestion fails with "def of non-name" error

 knowledge ingest -d mytestpdflarge /Users/sangeethahariharan/Downloads/1000-Ways-to-Make-1000-Dollars.pdf
2024/05/03 14:50:02 INFO IngestOpts opts="{Filename:0x1400cae2a40 FileMetadata:0x1400b154a00 IsDuplicateFuncName: IsDuplicateFunc:0x105e68920}"
2024/05/03 14:50:02 ERROR Failed to load PDF filename=1000-Ways-to-Make-1000-Dollars.pdf error="def of non-name"
2024/05/03 14:50:02 ERROR Failed to load documents error="failed to load PDF \"1000-Ways-to-Make-1000-Dollars.pdf\": def of non-name"
2024/05/03 14:50:02 failed to load documents: failed to load PDF "1000-Ways-to-Make-1000-Dollars.pdf": def of non-name
iwilltry42 commented 4 months ago

I get this with every single PDF Text extractor I have been testing so far, including the commercial offering of MuPDF. No idea if we'll be able to solve this anytime soon. Did you find any other PDFs apart from this one that yielded this error? It appears to be a structural error in the file itself.

sangee2004 commented 4 months ago

I have not encountered this error with other PDF files I have tested so far. But I was able to ingest this PDF file successfully previosuly when testing with python version of the tool from. - https://github.com/gptscript-ai/knowledge-retrieval-api

iwilltry42 commented 4 months ago

Fixed in HEAD (top of main branch)

sangee2004 commented 4 months ago

Tested with latest knowledge.

Able to ingest and retrieve information successfully from the PDF file attached in this issue.