jsvine / pdfplumber

Plumb a PDF for detailed information about each char, rectangle, line, et cetera — and easily extract text and tables.
MIT License
6.57k stars 659 forks source link

The same table is distributed on two pages, and some data extraction fails #720

Closed AresElvis closed 2 years ago

AresElvis commented 2 years ago

Describe the bug

The table on the first page is partly on the second page due to space reasons, but the data in the first row of the second page cannot be extracted during extraction

Code to reproduce the problem

file_name = '《新能源汽车推广应用推荐车型目录》(2022年第6批)车型主要参数.pdf.pdf' pdf = pdfplumber.open(file_path_dir + file_name) for i, page in enumerate(pdf.pages): tables = page.extract_tables()

PDF file

《新能源汽车推广应用推荐车型目录》(2022年第6批)车型主要参数.pdf.pdf

Expected behavior

外廓尺寸宽(mm): 2550

image

Actual behavior

image

Screenshots

If applicable, add screenshots to help explain your problem.

Environment

Additional context

Add any other context/notes about the problem here.