Closed Ceros95 closed 3 years ago
Hi @Ceros95 Appreciate your interest in the library. When dealing with text extraction related issues, the first step would be to check if pdfminer.six
is able to extract it or not. Behind the scenes, pdfplumber
relies on pdfminer.six. They also provide a handy tool for text extraction that can be found here. When you run the PDF on it like python pdf2txt.py file.pdf
, is the text getting extracted? If not, you should create an issue there.
You can also try repairing the PDF using Ghostscript like so
gs -o output.pdf -sDEVICE=pdfwrite input.pdf
and then try the text extraction.
@samkit-jain Is there an alternate way to repair PDFs apart from GS. GS has some licensing issues.
@sreeni5493 Another tool is https://community.coherentpdf.com/, though not knowing the specifics of your situation, I don't know whether it satisfies your license-type requirements.
Closing due to the lack of specific information necessary for any action/troubleshooting. Feel free to reopen @Ceros95 with more details.
Describe the bug
I have two PDF:s. Both with a template and each containing information about a person. Reading one of them works perfectly and all text is included, while when reading the other all information about the person is lost and only the template is included. Do you have any idea what goes wrong and why?
Code to reproduce the problem
with pdfplumber.open('example.pdf') as pdf: first_page = pdf.pages[0] text = first_page.extract_text()
PDF file
I'm afraid I can't share the PDF:s. They are sensitive in themselves and redacting them wold not be enough. I'm very sorry about this.
Environment
Additional context
A slight difference I have discovered is in the metadata:
PDF1 (which works):
{'Creator': 'TargetStream Technologies', 'Author': 'TargetStream Technologies', 'Producer': 'TargetStream StreamEDS rv1.7.76 for XX'}
PDF2 (which does not work): {'Author': 'TargetStream Technologies', 'CreationDate': "D:20210709131913+02'00'", 'Creator': 'TargetStream Technologies', 'ModDate': "D:20210709131913+02'00'", 'Producer': 'TargetStream StreamEDS rv1.7.76 for XX'}
I don't know if this is of any help or relevance.