jsvine / pdfplumber

Plumb a PDF for detailed information about each char, rectangle, line, et cetera — and easily extract text and tables.
MIT License
6.57k stars 659 forks source link

The position of the words in tables are out of order #498

Closed changlongpan closed 3 years ago

changlongpan commented 3 years ago

Describe the bug

The position of the words in tables are out of order

Code to reproduce the problem

import pandas as pd
import pdfplumber

count=1
with pdfplumber.open('./new(1).pdf',precision=2) as f:
    with pd.ExcelWriter('./out.xlsx') as w:  # 创建多张工作表
        for index, page in enumerate(f.pages):
            for index1,table in enumerate(page.extract_tables()):
                for i, _ in enumerate(table[0]):
                    if table[0][i] == None:
                        continue
                    table[0][i] = table[0][i].replace('\n', '')
                data = pd.DataFrame(table[1:], columns=table[0]).replace('-\n', '', regex=True)
                data = data.replace('\n', '', regex=True)
                data.to_excel(w, sheet_name='{}'.format(count), encoding="utf_8_sig", index=False)
                count += 1

PDF file

new(1).pdf can open it in chrome

Expected behavior

What did you expect the result should have been? image

Actual behavior

What actually happened, instead? The position of the words is out of order image

Screenshots

If applicable, add screenshots to help explain your problem.

Environment

Additional context

I tried to fix this issue I use debug and I find the problem in def cluster_objects(objs, attr, tolerance) image this attr is in utils.py-extract_text() and its value is "doctop" image when use this value, the same line in my pdf's table is groupby into multiple line and I tried to set the y_tolerance to 6,but still have some problem

so I change the attr from doctop to y0 and set image then the output is right image

I want to know why the attr use doctop but not y0

jsvine commented 3 years ago

Hi @changlongpan, and thanks for your interest in this library. To answer your general question: doctop measures the vertical distance of the top of the object from the top of the entire PDF, while y0 measures the vertical distance from the bottom of the object to the bottom of the page. In the PDF spec, the "origin" of the coordinate system is the bottom-left corner of the page; in pdfplumber, we've added top, doctop, and bottom to make it more compatible with how people (at least in English) typically read PDF (i.e., top-left to bottom-right). The implementation in the code you have screenshotted is intentional.

On the question of your PDF specifically: The issue seems to be that the Chinese characters — at least as extracted by pdfminer.six, the library we use to extract them — are positioned differently than the English/Latin characters. For instance here are the key vertical positional variables for the "realmCode" row and the characters that they match:

doctop top bottom # chars text
3912.16 191.32 208.5 17 realmCode1..1'CN'
3918.12 197.28 205.53 8 地域代码代表中国

As you can see, there is a difference of about 5.96pts on the top and about 2.97pts on the bottom. You can see this visually by using the pdfplumber's visual debugging tools:

page = pdf.pages[5]
im = page.to_image(resolution=300)
im.draw_rects(page.chars)

The relevant subpart of that image — though unfortunately my computer is having trouble rendering the Chinese characters:

Screen Shot 2021-09-02 at 9 20 24 AM

Changing the extraction to page.extract_tables({ "text_y_tolerance": 6 }) seems to resolve this problem for me, in the specific example highlighted above, outputting ['realmCode', '1..1', "地域代码,'CN'代表中国", ''] for the "realmCode" row. You note, however, that that approach still causes you some problems. What are they?

Given that this is not a bug in pdfplumber, I'm closing this issue, but you are welcome to continue the conversation here.