Open zhangtingyun opened 8 months ago
I don't know if my change is correct, please let me know or can you fix this bug, thanks!
page_1.pdf this is the pdf
Hi, I'm also facing the same issue while using pdfplumber which is developed base on pdfminer.six.
In my usage, the pdfminer.six version is 20221105
, pdfplumber version is 0.10.4
Even though I've tried repaired PDFs with ghostscripts, the as follow:
gswin64c -o repaired.pdf -sDEVICE=pdfwrite input.pdf
output file is repaired.pdf Reference : https://github.com/jsvine/pdfplumber/issues/425
The repaired.pdf is still out of order while extracting text.
And I tried to remove * fontsize
in descent
The result goes correct
Is this a bug or something? Thanks
https://github.com/pdfminer/pdfminer.six/issues/948#issuecomment-2006396235
Is it possible to follow the above method?
According to the above method, the problem is not solved
I just simply called the method in pdfminer to parse a pdf, but there is a problem with the coordinates of the parsed result, which is different from what I expected, sometimes the coordinates will be high, sometimes it will be low,but pdfJs can solve this problem I've made some modifications that fix this Tm_mul_CTM = matrix Th = scaling Tfs = fontsize _render_matrix = (Tfs * Th, 0, # 0 0, Tfs, # 0 0, rise # 1 ) Trm = mult_matrix(_render_matrix, Tm_mul_CTM) (a, b, c, d, e, f) = Trm w, h = x1 - x0, y1 - y0 (x0, y0) = (e, f) (x1, y1) = (x0 + w, y0 + h) y0, y1 = y0 + descent, y1 + descent