char' top and bottom attributes exceed page's bbox

kalelsun commented 1 year ago

Describe the bug

When using pdfplumber to read a PDF, I encountered an issue where the top and bottom attributes of chars on certain pages exceed the bbox (bounding box) of the page.

Code to reproduce the problem

import pdfplumber

pdf = pdfplumber.open('./the_page.pdf')
first_page = pdf.pages[0]
print('page bbox:\n', first_page.bbox)
print('chars bbox:')
for char in first_page.chars:
    print(char['text'], 'x0:', char['x0'], 'top:', char['top'], 'x1:', char['x1'], 'bottom:', char['bottom'], sep='\t')

# page bbox:
#  (0, 0, 631.08, 841.68)
# chars bbox:
# 2 x0: 384.74115   top:    1330.5039889999998  x1: 390.74115   bottom: 1342.5039889999998
# 0 x0: 390.74115   top:    1330.5039889999998  x1: 396.74115   bottom: 1342.5039889999998
# 2 x0: 396.74115   top:    1330.5039889999998  x1: 402.74115   bottom: 1342.5039889999998
# 1 x0: 402.74115   top:    1330.5039889999998  x1: 408.74115   bottom: 1342.5039889999998
# 3 x0: 437.268646  top:    1330.5039889999998  x1: 443.268646  bottom: 1342.5039889999998
# 3 x0: 474.7883    top:    1330.5039889999998  x1: 480.7883    bottom: 1342.5039889999998
# 1 x0: 480.7883    top:    1330.5039889999998  x1: 486.7883    bottom: 1342.5039889999998

PDF file

the_page.pdf

Expected behavior

the top and bottom attributes of chars should be within the bbox of the page.

Actual behavior

the top and bottom attributes of characters exceed the bbox of the page.

Screenshots

Only the areas enclosed by the blue boxes contain characters, the rest are images.

Environment

pdfplumber version: 0.9.0
Python version: 3.10.12
OS: Linux

Additional context

colab

jsvine commented 1 year ago

Hi @kalelsun, and thanks for your interest in this library. Have you tried repairing the PDF? Does that change the results? When I've seen issues like these in the past, they often are caused by malformed documents.

kalelsun commented 1 year ago

Hi @kalelsun, and thanks for your interest in this library. Have you tried repairing the PDF? Does that change the results? When I've seen issues like these in the past, they often are caused by malformed documents.

Yes, you are absolutely right! I have achieved the desired result by fixing the PDF.

jsvine / pdfplumber