py-pdf / pypdf

A pure-python PDF library capable of splitting, merging, cropping, and transforming the pages of PDF files
https://pypdf.readthedocs.io/en/latest/
Other
8.11k stars 1.39k forks source link

Issue in text extraction (spaces) #1153

Open pubpub-zz opened 2 years ago

pubpub-zz commented 2 years ago

Issue with text extraction (spacing)

Environment

Which environment were you using when you encountered the problem? windows 10

$ python -c "import PyPDF2;print(PyPDF2.__version__)"
2.7.0

Code + PDF

import PyPDF2;PyPDF2.PdfFileReader(open('c:/file-0.pdf', 'rb')).pages[3].extract_text()

file-0.pdf

result from text extraction (beginning only)

APPROVEDShortlyaftertheGenevaBOFsession,thewww-vrmlmailinglistwascreatedtodiscuss\nthedevelopmentofaspecificationforthefirstversionofVRML.Theresponsetothelist

other case (space dissaperaring???) import PyPDF2;PyPDF2.PdfFileReader(open('c:/2017.pdf', 'rb')).pages[0].extract_text()

2017年年度报告.pdf

observed on the footer( 2018 年04 月)

ssjkamei commented 6 days ago

This is not a proposed fix, but I hope it helps.

The following condition when operator is Td seems to be wrong.

https://github.com/py-pdf/pypdf/blob/d974d5c755a7b65f3b9c68c5742afdbc0c1693f6/pypdf/_text_extraction/__init__.py#L136

In the sample shown in file-0.pdf, the width of the position of the previous character and the current character (delta_x) is greater than the blank size, but the comparison itself is different because a character with a different width is passed each time as a process. In this case, it seems that the comparison is needed when the delta_x that comes across is greater than the previous character width plus the space width.

The calculated value itself also seemed wrong in this case.

As an example, there was a pattern that came in with the following value: spacewidth * f * 15 was about 25, which is an impossible value given the width of delta_x. For example, the delta_x of lowercase o contained 6.

delta_x = 6
spacewidth = 0.125
f = 11.9989952004
font_size = 11.999

I do not know where to get the correct data from.

ssjkamei commented 5 days ago

If the number 15 is a unit matching of spacewidth, then what we wanted to do here might be an expression like the following.

and abs(delta_x) > (spacewidth * k * 15) + f

However, as an expression, it supports monospaced font, but not propotional font. It looks like I need to get the font width for each character, is it possible?