The text is judged to be tabulated

jsvine / pdfplumber

Plumb a PDF for detailed information about each char, rectangle, line, et cetera — and easily extract text and tables.

MIT License

6.57k stars 659 forks source link

Code to reproduce the problem

def get_table_whole_doc(pdf_file): table_result_list = [] print("start get tables") start = time.time() pdf = pdfplumber.open(pdf_file) # 遍历每页pdf print("耗时: {:.2f}秒".format(time.time() - start)) for page in pdf.pages: tables = page.extract_tables() table_result_list.append(tables) print("耗时: {:.2f}秒".format(time.time() - start)) return table_result_list

Hi @Veunsia, and thank you for your interest in this library. I appreciate you sharing the PDF. What page is this? I tried to find it in the PDF, but the PDF is 621 pages long and I could not find it. In any case, it's likely that the page has invisible lines or rectangles in it. You can test that this way:

im = page.to_image()
im.debug_tablefinder()

... or:

im = page.to_image()
im.draw_lines(page.lines, stroke="red").draw_rects(page.rects, stroke="blue")

If that does appear to be the case, you can try using page.filter(...) to exclude lines/rects with certain characteristics.

Given that this is unlikely to be a bug, and is rather a request for troubleshooting of a specific PDF, I'm closing this issue. But feel free to continue the discussion here, or in the Discussion section for specific-PDF troubleshooting: https://github.com/jsvine/pdfplumber/discussions/categories/get-help-with-specific-pdfs

jsvine / pdfplumber

The text is judged to be tabulated #569

Describe the bug

Code to reproduce the problem

PDF file

Screenshots

Environment