py-pdf / pypdf

A pure-python PDF library capable of splitting, merging, cropping, and transforming the pages of PDF files
https://pypdf.readthedocs.io/en/latest/
Other
8.04k stars 1.39k forks source link

Inconsistent hyphenation (and lost blanks) #2262

Open BrainAnnex opened 10 months ago

BrainAnnex commented 10 months ago

I'm trying to extract text from PDF documents, to isolate individual words and create an indexing system.

Some PDF files are parsed fine, but others (such as the attached "Ocean Currents.pdf") are disasters! Here's an example of the parsed text from the second page of the document:

the current flows in the op-\nposite direction to the surface current. This shift of currentdirectionswithdepth,combinedwiththedecreaseinveloc-ity with depth, is called the Ekman spiral .\nThevelocityofthesurfacecurrentisthesumoftheve-\nlocitiesoftheEkman,geostrophic,tidal,andothercurrents.The Ekman surface current or wind drift current dependsuponthespeedofthewind,itsconstancy,thelengthoftimeit has blown, and other factors. In general, however, winddriftcurrentisabout2percentofthewindspeed,oralittleless,indeepwaterwherethewindhasbeenblowingsteadi-ly for at least 12 hours.\n3203. Currents Related To Density Differences\nThe density of water varies with salinity, temperature,\nand pressure. At any given depth, the differences in densityaredueonlytodifferencesintemperatureandsalinity.With\nsufficientdata,mapsshowinggeographicaldensitydistribu-tion at a certain depth can be drawn, with lines connectingpoints of equal density. These lines would be similar to iso-bars on a weather map

Notice 2 problems:

1) many words are attached together, with the blanks spaces lost; example:

currentdirectionswithdepth,combinedwiththedecreaseinveloc-ity

2) hyphenation is rendered inconsistently. For example (see screenshot below):

op-
posite

is extracted as op-\nposite (with a newline), while:

iso-
bars

is extracted as iso-bars (no newline!)

source

Code + PDF

Ocean Currents.pdf
(full document attached; please add to your tests)

pip install pypdf            #  Using version 3.16.4
from pypdf import PdfReader
pdf_name = "Ocean Currents.pdf"
reader = PdfReader(pdf_name)
p = reader.pages[1]
p.extract_text()

Thoughts

I suspect you'll say that the attached PDF is malformed. Maybe it is... but another software, PyMuPDF, parses it just fine.

In fact, I have noticed that the lost spaces are far more prevalent in extractions by pypdf, compared to PyMuPDF - BUT for some files it's the opposite, and pypdf does far better.

Empirically, I've noticed an intriguing complementary between pypdf and PyMuPDF : for files where one messes up badly, the other one does well - and vice versa. Maybe a different threshold of how to detect blank spaces in sentences?

But the inconsistent hyphenation I mentioned at the beginning is another issue that seriously gets in the way of word extraction...

Thanks!

stefan6419846 commented 10 months ago

Please see the corresponding docs as well: https://pypdf.readthedocs.io/en/latest/user/extract-text.html#why-text-extraction-is-hard To summarize: Text extraction is hard and involves quite some guessing - you only have individual character positions by default, all remaining steps tend to use heuristics to form words etc., thus they are not always correct. (Speaking of (Py)MuPDF: They provide commercial solutions as well and thus might have better general results.)

MartinThoma commented 10 months ago

Thanks for sharing the file and some examples! This helps a lot to refine our heuristics.

I agree with everything @stefan6419846 said. There is little hope to ever solve this completely for all pdf documents.

Do you own the license of that file or is it public domain? I'm always interested in refining my benchmark for text extraction

BrainAnnex commented 10 months ago

It'd be nice to have a user-settable threshold, for situations (not super-common, but not exactly rare, either - in my tests) when the words are not spaced enough for the algorithm to make the right choice.

Does such a setting exists?

BrainAnnex commented 10 months ago

Also, the inconsistent hyphenation (sometimes leading to extracted text with a newline and sometimes without), is a separate issue altogether. Maybe I ought to have started 2 separate discussion threads...

BrainAnnex commented 10 months ago

Do you own the license of that file or is it public domain? I'm always interested in refining my benchmark for text extraction

@MartinThoma - it's the PDF version of a book I used to own. I don't know if it's public domain. Doesn't a page extracted for technical tests qualify for "fair use"?

MartinThoma commented 10 months ago

it's the PDF version of a book I used to own. I don't know if it's public domain

In that case I would advise against sharing it publicly. Private sharing might be OK, but I'm not a lawyer and I don't want to get into / cause issues :sweat_smile: