I got a bug when i parsing a pdf!!!

idiotTest commented 4 months ago

Describe the bug

when i use the "PDFPlumberLoader" provided by langchain,a bug occurs. the bug is "File "D:\annaconda\envs\fastapitest\lib\site-packages\pdfplumber\utils\pdfinternals.py", line 16, in return "".join(PDFDocEncoding[o] for o in ords) IndexError: string index out of range "

Have you tried repairing the PDF?

I run the code

pdfplumber.repair(r"D:\google\Linux运维趋势_第10期_日志分析技巧分享.pdf",
                  outfile="./repaired1.pdf",
                  gs_path=r"C:\Program Files\gs\gs10.03.1\bin\gswin64.exe")

but the bug is also occurs.

Code to reproduce the problem

This is the code occur erros.

from langchain_community.document_loaders import PDFPlumberLoader

loader = PDFPlumberLoader(r"./repaired1.pdf")

content = loader.load()

print(content)

PDF file

this is the error file.

Linux运维趋势_第10期_日志分析技巧分享.pdf

If applicable, add screenshots to help explain your problem.

Environment

Python version: 3.10
OS: [Linux, Windows]

jsvine commented 3 months ago

Hi @idiotTest, and thanks for flagging this. I've added a fix for issues like these in v0.11.1. Specific change here: https://github.com/jsvine/pdfplumber/commit/4daf0aa1b9132a74a4d1b8960ca95b6607ac49c7

idiotTest commented 3 months ago

Hi @idiotTest, and thanks for flagging this. I've added a fix for issues like these in v0.11.1. Specific change here: 4daf0aa

ok,Thanks for your reply.I will try later.

jsvine / pdfplumber