jsvine / pdfplumber

Plumb a PDF for detailed information about each char, rectangle, line, et cetera — and easily extract text and tables.
MIT License
6.57k stars 659 forks source link

TXWylie01a-FIN.pdf #250

Closed shravspy closed 4 years ago

shravspy commented 4 years ago

Describe the bug

I tried using extract_text() as in this notebook but with my pdf it skips hyphens in the table

I have attached the file and I am trying to extract page number 163 TXWylie01a-FIN (2).pdf

A clear and concise description of what the bug is.


       pdf = pdfplumber.open('TXWylie01a-FIN.pdf')
       table = pdf.pages[163].extract_text() ```

## Code to reproduce the problem

*Paste it here, or attach a Python file.*
<img width="1440" alt="Screen Shot 2020-08-11 at 11 13 54 PM" src="https://user-images.githubusercontent.com/26505544/89971194-94b65000-dc28-11ea-86e3-5dd862eeab82.png">

## PDF file

*Please attach any PDFs necessary to reproduce the problem.*

*If you need to redact text in a sensitive PDF, you can run it through [JoshData/pdf-redactor](https://github.com/JoshData/pdf-redactor).*

## Expected behavior

*What did you expect the result __should__ have been?*
I wish to get the exact line with layout, meaning it should consider all the spaces and hyphens in the table. 

## Actual behavior

Spaces are ignored and hyphens are removed. I am using hyphens as a token to distinguish 'NA' values in the columns
*What actually happened, instead?*

## Screenshots

*If applicable, add screenshots to help explain your problem.*

## Environment

- pdfplumber version: [e.g., 0.5.22]
- Python version: [e.g., 3.8.1]
- OS: [e.g., Mac, Linux, etc.]

## Additional context

*Add any other context/notes about the problem here.*
<img width="1440" alt="Screen Shot 2020-08-11 at 11 13 54 PM" src="https://user-images.githubusercontent.com/26505544/89971407-2a51df80-dc29-11ea-94c0-9fdf1a11d8d0.png">
jsvine commented 4 years ago

Hi @shravspy, and thanks for your interest in this library! Two questions:

shravspy commented 4 years ago

Hi @jsvine, thank you for quick replay.

Actually I was following the san-jose notebook to work with my pdf file. Anyhow I have updated all the information now :)

samkit-jain commented 4 years ago

Hi @shravspy The page 164 in the PDF you have uploaded appears to be scanned with OCR run on it. The reason the hyphens are not getting extracted is that the OCR didn't detect them as is evident from this screenshot image

Regarding the page layout, have a look at issue #10

I am closing this issue since it seems to be an issue with the PDF and not the library.

shravspy commented 4 years ago

Hi @samkit-jain if it is scanned with OCR run on it and OCR skips the hyphens, why do I still see the hyphens in the pdf. Second do you know any workaround this to be able to get text data as it is from the pdf? Thank You

samkit-jain commented 4 years ago

Hi @shravspy When a PDF is a scanned document, every page in a PDF is basically just an image with no copyable content. When an OCR software is run on the PDF, it creates a text layer (transparent) over the image layer which allows you to copy text from the PDF. No OCR software is perfect and their performance is affected by multiple factors like scan quality, skewness, shadows, etc. which means that certain characters may be wrongly recognised or not recognised at all. The final output is a union of the 2 layers and not an intersection that's why you would still be seeing the original image as-is even if certain characters were not recognised. The text layer in most cases won't hide the regions it is not covering which would explain why you are seeing the hyphens but not able to copy-paste them.

It's a different story if you're able to copy-paste the hyphens but pdfplumber isn't. Is that the case?


Irrespective of whether the PDF is an OCR output or not, you would need an OCR software to recognise those hyphens. A workaround could be to run the page through another OCR software like the open-sourced Tesseract or paid ABBYY. Both provide the option to return as output a PDF with the text layer over it.