Closed shravspy closed 4 years ago
Hi @shravspy, and thanks for your interest in this library! Two questions:
The title of the issue doesn't seem to match the PDF you've uploaded. Can you confirm you uploaded the correct PDF?
Could you edit your issue to more completely provide the issue template's requested details? Doing so will help this library's maintainers help you (and other users) more efficiently.
Hi @jsvine, thank you for quick replay.
Actually I was following the san-jose notebook to work with my pdf file. Anyhow I have updated all the information now :)
Hi @shravspy The page 164 in the PDF you have uploaded appears to be scanned with OCR run on it. The reason the hyphens are not getting extracted is that the OCR didn't detect them as is evident from this screenshot
Regarding the page layout, have a look at issue #10
I am closing this issue since it seems to be an issue with the PDF and not the library.
Hi @samkit-jain if it is scanned with OCR run on it and OCR skips the hyphens, why do I still see the hyphens in the pdf. Second do you know any workaround this to be able to get text data as it is from the pdf? Thank You
Hi @shravspy When a PDF is a scanned document, every page in a PDF is basically just an image with no copyable content. When an OCR software is run on the PDF, it creates a text layer (transparent) over the image layer which allows you to copy text from the PDF. No OCR software is perfect and their performance is affected by multiple factors like scan quality, skewness, shadows, etc. which means that certain characters may be wrongly recognised or not recognised at all. The final output is a union of the 2 layers and not an intersection that's why you would still be seeing the original image as-is even if certain characters were not recognised. The text layer in most cases won't hide the regions it is not covering which would explain why you are seeing the hyphens but not able to copy-paste them.
It's a different story if you're able to copy-paste the hyphens but pdfplumber isn't. Is that the case?
Irrespective of whether the PDF is an OCR output or not, you would need an OCR software to recognise those hyphens. A workaround could be to run the page through another OCR software like the open-sourced Tesseract or paid ABBYY. Both provide the option to return as output a PDF with the text layer over it.
Describe the bug
I tried using extract_text() as in this notebook but with my pdf it skips hyphens in the table
I have attached the file and I am trying to extract page number 163 TXWylie01a-FIN (2).pdf
A clear and concise description of what the bug is.