jsvine / pdfplumber

Plumb a PDF for detailed information about each char, rectangle, line, et cetera — and easily extract text and tables.
MIT License
6.1k stars 625 forks source link

Add `repair` method? #824

Closed jsvine closed 1 year ago

jsvine commented 1 year ago

In issue #799, @sandzone had the suggestion to add PDF-repairing as a pdfplumber feature. Although it would be impractical for pdfplumber to do the repairing itself, it seems feasible to have the library shell out to Ghostscript and/or other command-line tools (e.g., poppler, mutool, etc.) that do PDF repair. I could see having two interfaces:

Whatever the interface, this would probably need some clear exception handling for when the user had not installed Ghostscript/etc.

sandzone commented 1 year ago

Thanks for opening a separate thread @jsvine.

Adding a repair preprocess is also improving the general quality of pdf parsing. For me, repair process also solved common 'text' issues. Parsed text was different (jumbled) from what was being displayed in Okular. Repair pre-process resolved that issue too.

For linux machines (also works with AWS lambda), invoking the preprocess step via subprocess.call() is my current solution.

samkit-jain commented 1 year ago

My preference would be for the second option. When the repair fails, the PDF should still be loaded correctly and the failure to repair be notified as a warning.

Passing a boolean or a string to the repair keyword might be a bit confusing.

jsvine commented 1 year ago

@samkit-jain, I like your proposal for breaking out repair: bool and repair_method: str into separate parameters.

Re. this:

My preference would be for the second option

I was actually proposing implementing both interfaces; the second interface could, internally, use the code written for the first interface. Or do you think better just to have the second, without the first?

samkit-jain commented 1 year ago

I was actually proposing implementing both interfaces; the second interface could, internally, use the code written for the first interface. Or do you think better just to have the second, without the first?

I am sorry I am unable to understand. Could you please elaborate maybe with an example?

jsvine commented 1 year ago

Sure! Interface 1:

import pdfplumber
repaired_pdf_bytes = pdfplumber.repair("corrupted.pdf")
with open("fixed.pdf", "wb") as f:
  f.write(repaired_pdf_bytes)

... or similarly:

import pdfplumber
pdfplumber.repair("corrupted.pdf", outfile="repaired.pdf")

Interface 2:

import pdfplumber
pdf = pdfplumber.open("corrupted.pdf", repair=True)
page_text = pdf.pages[0].extract_text()
samkit-jain commented 1 year ago

Thanks @jsvine Able to understand now. Yes, it makes more sense. Gives more convenience to the user. Also, I think that we can add a new property repaired_pdf_path that will give the path to the repaired PDF. I think that the majority of the use-cases will be solved by interface 2. If there comes a use-case that the user wants to access the repaired PDF after using interface 2, instead of re-repairing the PDF using interface 1, they can use the exposed property and get the path to the already repaired PDF.

PS: I can also take up implementing this repair functionality unless of course you haven't already started working on it :)

jsvine commented 1 year ago

Thanks, @samkit-jain! That additional property sounds good to me. And thank you for offering to implement this! I haven't started on it yet.

jsvine commented 1 year ago

Now available in v0.10.0, with explanation added in https://github.com/jsvine/pdfplumber/blob/stable/docs/repairing.md

This is a new feature and somewhat experimental, so I haven't yet added it to the main documentation. I have, however, mentioned it in the bug-report issue template.

I didn't end up adding that additional property, but I'm still open to it!