jsvine / pdfplumber

Plumb a PDF for detailed information about each char, rectangle, line, et cetera — and easily extract text and tables.
MIT License
6.1k stars 625 forks source link

Incorrect row number when extract tables #920

Closed tujinshu closed 1 year ago

tujinshu commented 1 year ago

Describe the bug

A clear and concise description of what the bug is. Incorrect row number in green cycle

企业微信截图_b25fcb8e-2910-4580-b734-7eded7cfeb8c

actual data (page 45) image

Code to reproduce the problem

Paste it here, or attach a Python file.

import pdfplumber
pdf = pdfplumber.open("./wps.pdf")
p0 = pdf.pages[44]
im = p0.to_image()
im.debug_tablefinder()

PDF file

origin pdf wps.pdf

Expected behavior

What did you expect the result should have been? only one row should be extracted

Actual behavior

What actually happened, instead? split to 5 rows

Screenshots

If applicable, add screenshots to help explain your problem.

企业微信截图_b25fcb8e-2910-4580-b734-7eded7cfeb8c

Environment

Additional context

Add any other context/notes about the problem here.