jsvine / pdfplumber

Plumb a PDF for detailed information about each char, rectangle, line, et cetera — and easily extract text and tables.
MIT License
6.57k stars 659 forks source link

The visualization result is right,but the extraction result is wrong! #414

Closed OK-JH closed 2 years ago

OK-JH commented 3 years ago

Describe the bug

The visualization result is right,but the extraction result is wrong!and they use the same setting

Code to reproduce the problem

the main codes as follow: setting = { "vertical_strategy": "lines", "horizontal_strategy": "lines", "explicit_horizontal_lines": page.curves, "explicit_vertical_lines": page.edges+page.curves,

"snap_tolerance": 5,

        #"text_tolerance": 0.5,
        #"intersection_tolerance": 5,
        #"intersection_x_tolerance": 3,
        #"intersection_y_tolerance": 3
    }

im = page.to_image(resolution=150) im.reset().debug_tablefinder(setting) im.save('test.png', format='PNG')

PDF file

test.pdf

Screenshots

the visualization is next: image

the extraction result is next: image

as you see, I do not konw why the numbers in yellow box are divided into the cell on the left. I guess it is a bug. looking forward to your reply!

Environment

samkit-jain commented 3 years ago

Hi @OK-JH Appreciate your interest in the library. If you draw the characters (im.draw_rects(page.chars)), you'll notice that those numbers are actually a part of the second column and not the third and that's why you are noticing this unexpected behaviour. image

As to why this is happening, I don't have an answer for that at the moment.

jsvine commented 2 years ago

Thanks, @samkit-jain. Looks like the odd placement of those characters might be due to a font issue. Repairing the PDF with ghostscript seems to fix that — re-locating the those numbers to their correct positions:

Screen Shot 2022-07-20 at 2 36 52 PM

... though creating some strange encoding issues in the table extraction:

[['项    目', None, '周期', '\x80术要求', '说明'],
 ['巡检 \n项目', '外Ê检查', '1o或必要时', '无异常', '参É6.1.2 aĀ'],
 [None, '油温和绕组温þ检查', '1o或必要时', '符合设备\x80术文件要求', '参É6.1.2 bĀ'],
 [None, '呼吸器检查', '1o或必要时', '~燥剂总量的1/3~~燥状态', '参É6.1.2 cĀ'],
 [None, '冷却系统检查', '1o或必要时', '无异常', '参É6.1.2 dĀ'],
 [None, 'p载V接开关检查', '1o或必要时', '无异常', '参É6.1.2 eĀ'],
 [None, '声响及振动检查', '1o或必要时', '无异常', '参É6.1.2 fĀ'],
 ['红外热像检测', None, '1o或必要时', '1o或必要时', '参É6.1.3'],
 ['本体油\x7f色谱V析',
  None,
  '3o或必要时',
  '乙炔f1ÿ\uf06dL/LĀÿ注意值Ā \n氢气f150ÿ\uf06dL/LĀÿ注意值Ā \n总烃f150ÿ\uf06dL/LĀÿ注意值Ā',
  '参É6.1.4'],
 ['直流偏磁电流检查', None, '必要时', '', '参É6.1.5']]

Given that this seems to be an issue with the PDF itself (or possibly the font-handling in pdfminer.six), rather than pdfplumber, I'm closing this issue. Feel free to continue the discussion, however.