Open pubpub-zz opened 2 years ago
This is not a proposed fix, but I hope it helps.
The following condition when operator
is Td
seems to be wrong.
In the sample shown in file-0.pdf
, the width of the position of the previous character and the current character (delta_x
) is greater than the blank size, but the comparison itself is different because a character with a different width is passed each time as a process.
In this case, it seems that the comparison is needed when the delta_x
that comes across is greater than the previous character width plus the space width.
The calculated value itself also seemed wrong in this case.
As an example, there was a pattern that came in with the following value: spacewidth * f * 15
was about 25
, which is an impossible value given the width of delta_x
.
For example, the delta_x
of lowercase o contained 6.
delta_x = 6
spacewidth = 0.125
f = 11.9989952004
font_size = 11.999
I do not know where to get the correct data from.
If the number 15 is a unit matching of spacewidth, then what we wanted to do here might be an expression like the following.
and abs(delta_x) > (spacewidth * k * 15) + f
However, as an expression, it supports monospaced font, but not propotional font. It looks like I need to get the font width for each character, is it possible?
Issue with text extraction (spacing)
Environment
Which environment were you using when you encountered the problem? windows 10
Code + PDF
file-0.pdf
result from text extraction (beginning only)
APPROVEDShortlyaftertheGenevaBOFsession,thewww-vrmlmailinglistwascreatedtodiscuss\nthedevelopmentofaspecificationforthefirstversionofVRML.Theresponsetothelist
other case (space dissaperaring???) import PyPDF2;PyPDF2.PdfFileReader(open('c:/2017.pdf', 'rb')).pages[0].extract_text()
2017年年度报告.pdf
observed on the footer( 2018 年04 月)