The position of the words in tables are out of order

jsvine / pdfplumber

Plumb a PDF for detailed information about each char, rectangle, line, et cetera — and easily extract text and tables.

MIT License

6.57k stars 659 forks source link

Code to reproduce the problem

import pandas as pd import pdfplumber count=1 with pdfplumber.open('./new(1).pdf',precision=2) as f: with pd.ExcelWriter('./out.xlsx') as w: # 创建多张工作表 for index, page in enumerate(f.pages): for index1,table in enumerate(page.extract_tables()): for i, _ in enumerate(table[0]): if table[0][i] == None: continue table[0][i] = table[0][i].replace('\n', '') data = pd.DataFrame(table[1:], columns=table[0]).replace('-\n', '', regex=True) data = data.replace('\n', '', regex=True) data.to_excel(w, sheet_name='{}'.format(count), encoding="utf_8_sig", index=False) count += 1

Additional context

I tried to fix this issue I use debug and I find the problem in def cluster_objects(objs, attr, tolerance)

this attr is in utils.py-extract_text() and its value is "doctop"

when use this value, the same line in my pdf's table is groupby into multiple line and I tried to set the y_tolerance to 6,but still have some problem

so I change the attr from doctop to y0 and set

then the output is right

I want to know why the attr use doctop but not y0

Hi @changlongpan, and thanks for your interest in this library. To answer your general question: doctop measures the vertical distance of the top of the object from the top of the entire PDF, while y0 measures the vertical distance from the bottom of the object to the bottom of the page. In the PDF spec, the "origin" of the coordinate system is the bottom-left corner of the page; in pdfplumber, we've added top, doctop, and bottom to make it more compatible with how people (at least in English) typically read PDF (i.e., top-left to bottom-right). The implementation in the code you have screenshotted is intentional.

On the question of your PDF specifically: The issue seems to be that the Chinese characters — at least as extracted by pdfminer.six, the library we use to extract them — are positioned differently than the English/Latin characters. For instance here are the key vertical positional variables for the "realmCode" row and the characters that they match:

doctop	top	bottom	# chars	text
3912.16	191.32	208.5	17	realmCode1..1'CN'
3918.12	197.28	205.53	8	地域代码代表中国

As you can see, there is a difference of about 5.96pts on the top and about 2.97pts on the bottom. You can see this visually by using the pdfplumber's visual debugging tools:

page = pdf.pages[5]
im = page.to_image(resolution=300)
im.draw_rects(page.chars)

The relevant subpart of that image — though unfortunately my computer is having trouble rendering the Chinese characters:

Changing the extraction to page.extract_tables({ "text_y_tolerance": 6 }) seems to resolve this problem for me, in the specific example highlighted above, outputting ['realmCode', '1..1', "地域代码,'CN'代表中国", ''] for the "realmCode" row. You note, however, that that approach still causes you some problems. What are they?

Given that this is not a bug in pdfplumber, I'm closing this issue, but you are welcome to continue the conversation here.

jsvine / pdfplumber

The position of the words in tables are out of order #498

Describe the bug

Code to reproduce the problem

PDF file

Expected behavior

Actual behavior

Screenshots

Environment

Additional context