Closed OK-JH closed 2 years ago
Hi @OK-JH Appreciate your interest in the library. If you draw the characters (im.draw_rects(page.chars)
), you'll notice that those numbers are actually a part of the second column and not the third and that's why you are noticing this unexpected behaviour.
As to why this is happening, I don't have an answer for that at the moment.
Thanks, @samkit-jain. Looks like the odd placement of those characters might be due to a font issue. Repairing the PDF with ghostscript seems to fix that — re-locating the those numbers to their correct positions:
... though creating some strange encoding issues in the table extraction:
[['项 目', None, '周期', '\x80术要求', '说明'],
['巡检 \n项目', '外Ê检查', '1o或必要时', '无异常', '参É6.1.2 aĀ'],
[None, '油温和绕组温þ检查', '1o或必要时', '符合设备\x80术文件要求', '参É6.1.2 bĀ'],
[None, '呼吸器检查', '1o或必要时', '~燥剂总量的1/3~~燥状态', '参É6.1.2 cĀ'],
[None, '冷却系统检查', '1o或必要时', '无异常', '参É6.1.2 dĀ'],
[None, 'p载V接开关检查', '1o或必要时', '无异常', '参É6.1.2 eĀ'],
[None, '声响及振动检查', '1o或必要时', '无异常', '参É6.1.2 fĀ'],
['红外热像检测', None, '1o或必要时', '1o或必要时', '参É6.1.3'],
['本体油\x7f色谱V析',
None,
'3o或必要时',
'乙炔f1ÿ\uf06dL/LĀÿ注意值Ā \n氢气f150ÿ\uf06dL/LĀÿ注意值Ā \n总烃f150ÿ\uf06dL/LĀÿ注意值Ā',
'参É6.1.4'],
['直流偏磁电流检查', None, '必要时', '', '参É6.1.5']]
Given that this seems to be an issue with the PDF itself (or possibly the font-handling in pdfminer.six
), rather than pdfplumber
, I'm closing this issue. Feel free to continue the discussion, however.
Describe the bug
The visualization result is right,but the extraction result is wrong!and they use the same setting
Code to reproduce the problem
the main codes as follow: setting = { "vertical_strategy": "lines", "horizontal_strategy": "lines", "explicit_horizontal_lines": page.curves, "explicit_vertical_lines": page.edges+page.curves,
"snap_tolerance": 5,
im = page.to_image(resolution=150) im.reset().debug_tablefinder(setting) im.save('test.png', format='PNG')
PDF file
test.pdf
Screenshots
the visualization is next:
the extraction result is next:
as you see, I do not konw why the numbers in yellow box are divided into the cell on the left. I guess it is a bug. looking forward to your reply!
Environment