pdfminer / pdfminer.six

Community maintained fork of pdfminer - we fathom PDF
https://pdfminersix.readthedocs.io
MIT License
5.65k stars 906 forks source link

Out-of-order coordinates #948

Open zhangtingyun opened 4 months ago

zhangtingyun commented 4 months ago

I just simply called the method in pdfminer to parse a pdf, but there is a problem with the coordinates of the parsed result, which is different from what I expected, sometimes the coordinates will be high, sometimes it will be low,but pdfJs can solve this problem region_1_0 I've made some modifications that fix this Tm_mul_CTM = matrix Th = scaling Tfs = fontsize _render_matrix = (Tfs * Th, 0,  # 0                   0, Tfs,  # 0                   0, rise  # 1                   ) Trm = mult_matrix(_render_matrix, Tm_mul_CTM) (a, b, c, d, e, f) = Trm w, h = x1 - x0, y1 - y0 (x0, y0) = (e, f) (x1, y1) = (x0 + w, y0 + h) y0, y1 = y0 + descent, y1 + descent image

zhangtingyun commented 4 months ago

I don't know if my change is correct, please let me know or can you fix this bug, thanks!

zhangtingyun commented 4 months ago

page_1.pdf this is the pdf

zhangtingyun commented 4 months ago

a227a8e0-d20d-11ee-9186-1f397a94c388

Han860207 commented 3 months ago

Hi, I'm also facing the same issue while using pdfplumber which is developed base on pdfminer.six. In my usage, the pdfminer.six version is 20221105, pdfplumber version is 0.10.4 Even though I've tried repaired PDFs with ghostscripts, the as follow:

gswin64c -o repaired.pdf -sDEVICE=pdfwrite input.pdf 

output file is repaired.pdf Reference : https://github.com/jsvine/pdfplumber/issues/425

The repaired.pdf is still out of order while extracting text. image

https://github.com/pdfminer/pdfminer.six/blob/ebf7bcdb983f36d0ff5b40e4f23b52525cb28f18/pdfminer/layout.py#L375

And I tried to remove * fontsize in descent image The result goes correct image

Is this a bug or something? Thanks