jsvine / pdfplumber

Plumb a PDF for detailed information about each char, rectangle, line, et cetera — and easily extract text and tables.
MIT License
6.5k stars 658 forks source link

char' top and bottom attributes exceed page's bbox #932

Closed kalelsun closed 1 year ago

kalelsun commented 1 year ago

Describe the bug

When using pdfplumber to read a PDF, I encountered an issue where the top and bottom attributes of chars on certain pages exceed the bbox (bounding box) of the page.

Code to reproduce the problem

import pdfplumber

pdf = pdfplumber.open('./the_page.pdf')
first_page = pdf.pages[0]
print('page bbox:\n', first_page.bbox)
print('chars bbox:')
for char in first_page.chars:
    print(char['text'], 'x0:', char['x0'], 'top:', char['top'], 'x1:', char['x1'], 'bottom:', char['bottom'], sep='\t')

# page bbox:
#  (0, 0, 631.08, 841.68)
# chars bbox:
# 2 x0: 384.74115   top:    1330.5039889999998  x1: 390.74115   bottom: 1342.5039889999998
# 0 x0: 390.74115   top:    1330.5039889999998  x1: 396.74115   bottom: 1342.5039889999998
# 2 x0: 396.74115   top:    1330.5039889999998  x1: 402.74115   bottom: 1342.5039889999998
# 1 x0: 402.74115   top:    1330.5039889999998  x1: 408.74115   bottom: 1342.5039889999998
# 3 x0: 437.268646  top:    1330.5039889999998  x1: 443.268646  bottom: 1342.5039889999998
# 3 x0: 474.7883    top:    1330.5039889999998  x1: 480.7883    bottom: 1342.5039889999998
# 1 x0: 480.7883    top:    1330.5039889999998  x1: 486.7883    bottom: 1342.5039889999998

PDF file

the_page.pdf

Expected behavior

the top and bottom attributes of chars should be within the bbox of the page.

Actual behavior

the top and bottom attributes of characters exceed the bbox of the page.

Screenshots

page

Only the areas enclosed by the blue boxes contain characters, the rest are images.

Environment

Additional context

colab

jsvine commented 1 year ago

Hi @kalelsun, and thanks for your interest in this library. Have you tried repairing the PDF? Does that change the results? When I've seen issues like these in the past, they often are caused by malformed documents.

kalelsun commented 1 year ago

Hi @kalelsun, and thanks for your interest in this library. Have you tried repairing the PDF? Does that change the results? When I've seen issues like these in the past, they often are caused by malformed documents.

Yes, you are absolutely right! I have achieved the desired result by fixing the PDF.