jsvine / pdfplumber

Plumb a PDF for detailed information about each char, rectangle, line, et cetera — and easily extract text and tables.
MIT License
6.57k stars 659 forks source link

The text is judged to be tabulated #569

Closed wenderWang closed 2 years ago

wenderWang commented 2 years ago

Describe the bug

The text is judged to be tabulated

Code to reproduce the problem

def get_table_whole_doc(pdf_file):
    table_result_list = []
    print("start get tables")
    start = time.time()
    pdf = pdfplumber.open(pdf_file)
    # 遍历每页pdf
    print("耗时: {:.2f}秒".format(time.time() - start))
    for page in pdf.pages:
        tables = page.extract_tables()
        table_result_list.append(tables)
    print("耗时: {:.2f}秒".format(time.time() - start))
    return table_result_list

PDF file

2017-12-19-603712.SH-天津七一二通信广播股份有限公司首次公开发行股票招股说明书(申报稿2017年12月11日报送).pdf

Screenshots

image

Environment

jsvine commented 2 years ago

Hi @Veunsia, and thank you for your interest in this library. I appreciate you sharing the PDF. What page is this? I tried to find it in the PDF, but the PDF is 621 pages long and I could not find it. In any case, it's likely that the page has invisible lines or rectangles in it. You can test that this way:

im = page.to_image()
im.debug_tablefinder()

... or:

im = page.to_image()
im.draw_lines(page.lines, stroke="red").draw_rects(page.rects, stroke="blue")

If that does appear to be the case, you can try using page.filter(...) to exclude lines/rects with certain characteristics.

Given that this is unlikely to be a bug, and is rather a request for troubleshooting of a specific PDF, I'm closing this issue. But feel free to continue the discussion here, or in the Discussion section for specific-PDF troubleshooting: https://github.com/jsvine/pdfplumber/discussions/categories/get-help-with-specific-pdfs