jsvine / pdfplumber

Plumb a PDF for detailed information about each char, rectangle, line, et cetera — and easily extract text and tables.
MIT License
5.97k stars 618 forks source link

When I set repair=true,there is an error:'utf-8' codec can't decode byte 0xae in position 239: invalid start byte.Because of the original PDF? #1145

Open zyc1128 opened 1 month ago

zyc1128 commented 1 month ago

Describe the bug

A clear and concise description of what the bug is.

And When I use pages.page.char[x]["text"] to get contens by single char,some texts from tables have been lost.I also find there is no bytes_like of the key of image object,how can I save images in the PDF to local?

Have you tried repairing the PDF?

Please try running your code with pdfplumber.open(..., repair=True) before submitting a bug report.

Code to reproduce the problem

Paste it here, or attach a Python file.

PDF file

Please attach any PDFs necessary to reproduce the problem.

If you need to redact text in a sensitive PDF, you can run it through JoshData/pdf-redactor.

Expected behavior

What did you expect the result should have been?

Actual behavior

What actually happened, instead?

Screenshots

If applicable, add screenshots to help explain your problem.

Environment

Additional context

Add any other context/notes about the problem here.

jsvine commented 3 weeks ago

Version v0.11.1, just released, attempts to fix repair=True. Can you upgrade your version of pdfplumber (pip install -U pdfplumber) and try again?