Closed jsvine closed 1 year ago
Thanks for opening a separate thread @jsvine.
Adding a repair preprocess is also improving the general quality of pdf parsing. For me, repair process also solved common 'text' issues. Parsed text was different (jumbled) from what was being displayed in Okular. Repair pre-process resolved that issue too.
For linux machines (also works with AWS lambda), invoking the preprocess step via subprocess.call() is my current solution.
My preference would be for the second option. When the repair fails, the PDF should still be loaded correctly and the failure to repair be notified as a warning.
Passing a boolean or a string to the repair
keyword might be a bit confusing.
repair_method
which would accept a string. It has the disadvantage of having a new parameter added but has the advantage of keeping things a bit simple.repair
param to repair_method
which would be Optional[str]
. If None
, don't repair. If a string, repair using that technique.@samkit-jain, I like your proposal for breaking out repair: bool
and repair_method: str
into separate parameters.
Re. this:
My preference would be for the second option
I was actually proposing implementing both interfaces; the second interface could, internally, use the code written for the first interface. Or do you think better just to have the second, without the first?
I was actually proposing implementing both interfaces; the second interface could, internally, use the code written for the first interface. Or do you think better just to have the second, without the first?
I am sorry I am unable to understand. Could you please elaborate maybe with an example?
Sure! Interface 1:
import pdfplumber
repaired_pdf_bytes = pdfplumber.repair("corrupted.pdf")
with open("fixed.pdf", "wb") as f:
f.write(repaired_pdf_bytes)
... or similarly:
import pdfplumber
pdfplumber.repair("corrupted.pdf", outfile="repaired.pdf")
Interface 2:
import pdfplumber
pdf = pdfplumber.open("corrupted.pdf", repair=True)
page_text = pdf.pages[0].extract_text()
Thanks @jsvine Able to understand now. Yes, it makes more sense. Gives more convenience to the user. Also, I think that we can add a new property repaired_pdf_path
that will give the path to the repaired PDF. I think that the majority of the use-cases will be solved by interface 2. If there comes a use-case that the user wants to access the repaired PDF after using interface 2, instead of re-repairing the PDF using interface 1, they can use the exposed property and get the path to the already repaired PDF.
PS: I can also take up implementing this repair functionality unless of course you haven't already started working on it :)
Thanks, @samkit-jain! That additional property sounds good to me. And thank you for offering to implement this! I haven't started on it yet.
Now available in v0.10.0, with explanation added in https://github.com/jsvine/pdfplumber/blob/stable/docs/repairing.md
This is a new feature and somewhat experimental, so I haven't yet added it to the main documentation. I have, however, mentioned it in the bug-report issue template.
I didn't end up adding that additional property, but I'm still open to it!
In issue #799, @sandzone had the suggestion to add PDF-repairing as a
pdfplumber
feature. Although it would be impractical forpdfplumber
to do the repairing itself, it seems feasible to have the library shell out to Ghostscript and/or other command-line tools (e.g., poppler, mutool, etc.) that do PDF repair. I could see having two interfaces:pdfplumber.repair("path/to/malformed.pdf", "path/to/fixed.pdf", engine="ghostscript")
, which would write the fixed version.pdfplumber.open("path/to/malformed.pdf", repair=Optional[Union[bool, str]]: False)
, where you would either passTrue
or the name of the engine ("ghostscript", etc.). This would write the fixed PDF to a temporary file, and load that instead of the original PDF.Whatever the interface, this would probably need some clear exception handling for when the user had not installed Ghostscript/etc.