jsvine / pdfplumber

Plumb a PDF for detailed information about each char, rectangle, line, et cetera — and easily extract text and tables.
MIT License
6.57k stars 659 forks source link

Multiple lines are merged into one line #317

Closed trrk closed 3 years ago

trrk commented 3 years ago

What are you trying to do?

Cropped page text extraction

What code are you using to do it?

from decimal import Decimal
import pdfplumber

def crop_and_extract(pdf, bbox):
    cropped = page.crop(bbox, relative=True)
    return cropped.extract_text()

with pdfplumber.open("abcde.pdf") as pdf:
    # Slightly containing the top line
    bbox1 = (Decimal('75.84'), Decimal('109.44'),
             Decimal('137.28'), Decimal('137.28'))
    # Contains all of the two lines
    bbox2 = (Decimal('75.84'), Decimal('85.44'),
             Decimal('137.28'), Decimal('137.28'))

    page = pdf.pages[0]

    print("- bbox1")
    print(crop_and_extract(page, bbox1))
    print()

    print("- bbox2")
    print(crop_and_extract(page, bbox2))

PDF file

abcde.pdf

Expected behavior

- bbox1
ABCDE 
FGHIJK 

- bbox2
ABCDE 
FGHIJK 

Actual behavior

- bbox1
AFGBCHDIJKE  

- bbox2
ABCDE 
FGHIJK 

Screenshots

None

Environment

Additional context

When the crop range contains part of another line, the results are merged into one line. This seems to happen only when the range contains only a small part of another line.

I think it is undesirable to be merged into one line. Is there a good way to avoid this?

samkit-jain commented 3 years ago

Hi @trrk Appreciate your interest in the library. The extract_text() method adds a newline when the difference between the doctop of one character and the doctop of the next character is greater than y_tolerance (defaults to 3). In bbox1's case, the doctop of the top row is 109.440 and the bottom is 111.567. Since the difference is 2.127 which is less than the default y_tolerance of 3, it does not add a new line. You can reduce the y_tolerance to a lower value like 2 and then extract the text with the newline. Example extract_text(y_tolerance=2).

trrk commented 3 years ago

I see. I like the default y_tolerance value, and I use the library from the interface that allows me to select an area on the screen. I would therefore like to keep y_tolerance at 3. But then I read the documentation again and realized that I might be able to use the filter function to remove the small characters at the edges. Thank you for your answer.