Closed changlongpan closed 3 years ago
Hi @changlongpan, and thanks for your interest in this library. To answer your general question: doctop
measures the vertical distance of the top of the object from the top of the entire PDF, while y0
measures the vertical distance from the bottom of the object to the bottom of the page. In the PDF spec, the "origin" of the coordinate system is the bottom-left corner of the page; in pdfplumber
, we've added top
, doctop
, and bottom
to make it more compatible with how people (at least in English) typically read PDF (i.e., top-left to bottom-right). The implementation in the code you have screenshotted is intentional.
On the question of your PDF specifically: The issue seems to be that the Chinese characters — at least as extracted by pdfminer.six
, the library we use to extract them — are positioned differently than the English/Latin characters. For instance here are the key vertical positional variables for the "realmCode" row and the characters that they match:
doctop | top | bottom | # chars | text |
---|---|---|---|---|
3912.16 | 191.32 | 208.5 | 17 | realmCode1..1'CN' |
3918.12 | 197.28 | 205.53 | 8 | 地域代码代表中国 |
As you can see, there is a difference of about 5.96pts on the top and about 2.97pts on the bottom. You can see this visually by using the pdfplumber
's visual debugging tools:
page = pdf.pages[5]
im = page.to_image(resolution=300)
im.draw_rects(page.chars)
The relevant subpart of that image — though unfortunately my computer is having trouble rendering the Chinese characters:
Changing the extraction to page.extract_tables({ "text_y_tolerance": 6 })
seems to resolve this problem for me, in the specific example highlighted above, outputting ['realmCode', '1..1', "地域代码,'CN'代表中国", '']
for the "realmCode" row. You note, however, that that approach still causes you some problems. What are they?
Given that this is not a bug in pdfplumber
, I'm closing this issue, but you are welcome to continue the conversation here.
Describe the bug
The position of the words in tables are out of order
Code to reproduce the problem
PDF file
new(1).pdf can open it in chrome
Expected behavior
What did you expect the result should have been?
Actual behavior
What actually happened, instead? The position of the words is out of order
Screenshots
If applicable, add screenshots to help explain your problem.
Environment
Additional context
I tried to fix this issue I use debug and I find the problem in def cluster_objects(objs, attr, tolerance) this attr is in utils.py-extract_text() and its value is "doctop" when use this value, the same line in my pdf's table is groupby into multiple line and I tried to set the y_tolerance to 6,but still have some problem
so I change the attr from doctop to y0 and set then the output is right
I want to know why the attr use doctop but not y0