jsvine / pdfplumber

Plumb a PDF for detailed information about each char, rectangle, line, et cetera — and easily extract text and tables.
MIT License
5.99k stars 618 forks source link

Why is the order of extracting the contents in the table cells wrong? #1056

Closed xielulu1994 closed 6 months ago

xielulu1994 commented 7 months ago

Describe the bug

extract_table() to extract table content, and find that the order of extracted text in individual cells is inconsistent with the original text.

image

pdf table: image

Code to reproduce the problem

table_text_items: List[tuple] = []
    with pdfplumber.open(file_path) as pdf:
        for page in pdf.pages:
            table = page.extract_table()
            lines: List[str] = []
            if table:
                for row in table:
                    for line in [item for item in row if item is not None]:
                        if line:
                            lines.extend(line.split("\n"))
            if lines:
                table_text_items.append((page.page_number, lines))

PDF file

Please attach any PDFs necessary to reproduce the problem.

If you need to redact text in a sensitive PDF, you can run it through JoshData/pdf-redactor.