jsvine / pdfplumber

Plumb a PDF for detailed information about each char, rectangle, line, et cetera — and easily extract text and tables.
MIT License
6.57k stars 659 forks source link

Only template is extracted #486

Closed Ceros95 closed 3 years ago

Ceros95 commented 3 years ago

Describe the bug

I have two PDF:s. Both with a template and each containing information about a person. Reading one of them works perfectly and all text is included, while when reading the other all information about the person is lost and only the template is included. Do you have any idea what goes wrong and why?

Code to reproduce the problem

with pdfplumber.open('example.pdf') as pdf: first_page = pdf.pages[0] text = first_page.extract_text()

PDF file

I'm afraid I can't share the PDF:s. They are sensitive in themselves and redacting them wold not be enough. I'm very sorry about this.

Environment

Additional context

A slight difference I have discovered is in the metadata:

PDF1 (which works):

{'Creator': 'TargetStream Technologies', 'Author': 'TargetStream Technologies', 'Producer': 'TargetStream StreamEDS rv1.7.76 for XX'}

PDF2 (which does not work): {'Author': 'TargetStream Technologies', 'CreationDate': "D:20210709131913+02'00'", 'Creator': 'TargetStream Technologies', 'ModDate': "D:20210709131913+02'00'", 'Producer': 'TargetStream StreamEDS rv1.7.76 for XX'}

I don't know if this is of any help or relevance.

samkit-jain commented 3 years ago

Hi @Ceros95 Appreciate your interest in the library. When dealing with text extraction related issues, the first step would be to check if pdfminer.six is able to extract it or not. Behind the scenes, pdfplumber relies on pdfminer.six. They also provide a handy tool for text extraction that can be found here. When you run the PDF on it like python pdf2txt.py file.pdf, is the text getting extracted? If not, you should create an issue there.

You can also try repairing the PDF using Ghostscript like so

gs -o output.pdf -sDEVICE=pdfwrite input.pdf

and then try the text extraction.

sreeni5493 commented 3 years ago

@samkit-jain Is there an alternate way to repair PDFs apart from GS. GS has some licensing issues.

jsvine commented 3 years ago

@sreeni5493 Another tool is https://community.coherentpdf.com/, though not knowing the specifics of your situation, I don't know whether it satisfies your license-type requirements.

jsvine commented 3 years ago

Closing due to the lack of specific information necessary for any action/troubleshooting. Feel free to reopen @Ceros95 with more details.