py-pdf / pypdf

A pure-python PDF library capable of splitting, merging, cropping, and transforming the pages of PDF files
https://pypdf.readthedocs.io/en/latest/
Other
8.04k stars 1.39k forks source link

Exceptions / missing spaces in extract_text() method #17

Closed mstamy2 closed 2 years ago

mstamy2 commented 11 years ago

extractText() method isn't broken, but throws some exceptions in these cases:

http://doctor12wer.blogspot.com/2013/06/extracttext-function-in-pypdf2-throws.html

http://stackoverflow.com/questions/17270387/pypdf2-typeerror-when-trying-to-extract-text

tnorth commented 10 years ago

Hello,

Works for me, but the extracted text contains no spaces :/

input = PdfFileReader(open("foo.pdf", 'rb'))
print input.getPage(0).extractText()

Is that a known issue ?

tnorth commented 10 years ago

Hmm to make it more clear, the issue seem to appear for 2 columns papers, this one for example: www.rowland.harvard.edu/rjf/vollmer/images/vollmer_fischer.pdf

mstamy2 commented 10 years ago

The extractText method is probably a little crude, and definitely doesn't function well for PDFs with complicated text. It could use some work to return text in a more orderly fashion that more closely appears like the text you see in a PDF viewer.

alisufian commented 10 years ago

Another pdf where whitespace is not preserved in extracted text http://webapp.psc.state.md.us/Intranet/Casenum/NewIndex3_VOpenFile.cfm?ServerFilePath=C:\Casenum\9100-9199\9155\\354.pdf

kursataker commented 9 years ago

I tried to extract arabic text out of a PDF file using extractText() method. However, arabic text disappears in the output.

Lerchensporn commented 8 years ago

To resolve the problem of missing whitespaces, I propose the following for-loop in the extractText method. The part below “text += i” is new. The limit “i < -100” where a spacing becomes a whitespace is arbitrarily chosen; in a typical Springer pdf book a value of -300 to -200 determines a whitespace. Although this may look like a hack, I can think of no other criterion for a whitespace in such documents. edit: Furthermore, I suggest to remove “text += "\n"" after the TJ operator, because it breaks words in some documents. Handling of the TD, Td, Tm operators still demands refinement.

        for operands, operator in content.operations:
            if operator == b_("Tj"):
                _text = operands[0]
                if isinstance(_text, TextStringObject):
                    text += _text
            elif operator == b_("T*"):
                text += "\n"
            elif operator == b_("'"):
                text += "\n"
                _text = operands[0]
                if isinstance(_text, TextStringObject):
                    text += operands[0]
            elif operator == b_('"'):
                _text = operands[2]
                if isinstance(_text, TextStringObject):
                    text += "\n"
                    text += _text
            elif operator == b_("TJ"):
                for i in operands[0]:
                    if isinstance(i, TextStringObject):
                        text += i
                    elif isinstance(i, FloatObject) or isinstance(i, NumberObject):
                        if i < -100:
                            text += " "
            elif operator == b_("TD") or operator == b_("Tm"):
                if len(text) > 0 and text[-1] != " " and text[-1] != "\n":
                    text += " "
mborus commented 7 years ago

@woho's idea worked for me. I got too many spaces, so I changed the code slightly...

       # add spaces
       # q&d - https://github.com/mstamy2/PyPDF2/issues/17
                elif isinstance(i, FloatObject) or isinstance(i, NumberObject):
                    if text and (not text[-1] in " \n"):
                        text += " "
        elif operator == b_("TD") or operator == b_("Tm"):
            if text and (not text[-1] in " \n"):
                text += " "
        # end add spaces
chrisjcameron commented 6 years ago

Some PDFs apparently generate empty operands. If this condition is explicitly checked, then I can avoid some thrown exceptions:

_text = operands[0] throws an exception if operands is empty.

Quick fix:

for operands, operator in content.operations:
            if not operands:          # Empty operands list contributes no text
                operands = [""]
            if operator == b_("Tj"):
                _text = operands[0]
                if isinstance(_text, TextStringObject):
                    text += _text
Tom-Evers commented 6 years ago

There should be a newline somewhere:

        elif operator == b_("TJ"):
            for i in operands[0]:
                if isinstance(i, TextStringObject):
                    text += i
                elif isinstance(i, FloatObject) or isinstance(i, NumberObject):
                    if text and (not text[-1] in " \n"):
                        text += " "
            text += "\n"
Tom-Evers commented 6 years ago

It seems that the value of the Float/NumberObject directly encodes the distance between two pieces of text, with the width of one space equaling -600:

                if text and (not text[-1] in " \n"):
                        text += " " * int(i / -600)
MartinThoma commented 2 years ago

A lot of the whitespace issues got fixed via https://github.com/py-pdf/PyPDF2/pull/569

MartinThoma commented 2 years ago

924 Improved further on the whitespace issue

MartinThoma commented 2 years ago

I think it is fixed.

Minimal example

from PyPDF2 import PdfReader

reader = PdfReader("vollmer_fischer.pdf")  # www.rowland.harvard.edu/rjf/vollmer/images/vollmer_fischer.pdf
text = reader.pages[0].extract_text()

text now is:

Ring-resonator-based frequency-domain opticalactivity measurements of a chiral liquid
Frank Vollmer and Peer Fischer
The Rowland Institute at Harvard, Harvard University, Cambridge, Massachusetts 02142Received September 22, 2005; revised November 11, 2005; accepted November 12, 2005; posted November 16, 2005 (Doc. ID 64961)
Chiral liquids rotate the plane of polarization of linearly polarized light and are therefore optically active.Here we show that optical rotation can be observed in the frequency domain. A chiral liquid introduced in afiber-loop ring resonator that supports left and right circularly polarized modes gives rise to relative fre-quency shifts that are a direct measure of the liquid’s circular birefringence and hence of its optical activity.The effect is in principle not diminished if the circumference of the ring is reduced. The technique is simi-larly applicable to refractive index and linear birefringence measurements.
© 2006 Optical Society ofAmericaOCIS codes:260.1440, 120.5410
.Natural optical activity arises because a medium hasdifferent refractive indices for left (/H11002) and right (/H11001)circularly polarized light. The optical rotation, in ra-dians, developed over a path lengthlis a function ofthe wavelength/H9261and is given by
/H9258=/H9266l

/H9261/H20851n/H20849−/H20850−n/H20849+/H20850/H20852./H208491/H20850The circular birefringence,n
/H20849−/H20850−n/H20849+/H20850, is, however,even in a pure chiral liquid small and at most a fewparts in 10
6. It is thus desirable to increase the effec-tive path length through the optically active mediumwithout the need for large sample volumes. This canbe achieved in an optical cavity as long as one en-sures that the optical rotation does not cancel on theround trip, which in practice one can accomplish byplacing quarter-wave plates in the cavity.
1Signifi-cant enhancements in sensitivity compared withsingle-pass instruments have been reported for mea-surements that make use of Fabry–Perotresonators,
1–3including polarization-sensitive imple-mentations of cavity-ringdown spectroscopy,4,5aswell as laser cavities.6,7Both single-pass and multi-pass techniques typically determine the rotation inEq. (1) via intensity measurements that either re-quire rotating polarization optics or separate the or-thogonally polarized components of the light andtherefore require a balanced detection scheme.In this Letter we show that circular birefringence(optical rotation) can also be determined by fre-quency measurements. Left and right circularly po-larized modes acquire unequal phases when a chiralliquid is introduced into a resonator such that theirresonance frequencies shift relative to each other. Wedemonstrate the method, using a fiber optic ringresonator in combination with a narrow-linewidth cwlaser.A fiber-loop resonator
8,9may be considered to be afiber- or waveguide-based Fabry–Perot resonatorthat consists of a closed fiber loop in contact with alinear waveguide via a variable (directional) coupler.A resonance in the ring requires that the optical pathlength be a multiple of the wavelength of the light.Resonances are observed as minima in a transmis-sion spectrum whenever an integral multiple of thewavelength in the ring equals the circumference ofthe fiber loop. A shift in the resonance wavelength oc-curs if either the path length or the refractive indexchanges. Refractive indices may be measured by tun-ing the frequency of a laser with a sufficiently narrowlinewidth.Introduction of a sample with refractive indexnsinto the ring resonator will cause a wavelength shiftof the resonances relative to the reference mediumwith refractive indexn
0, which may, for instance, beair:/H9004/H9261
/H9261=ns−n0

nefff,/H208492/H20850wherefis the fraction of the total ring circumferencethat contains the optically active sample.n
effis an ef-fective refractive index used to describe the entirefiber-loop resonator in the presence of the referencemedium and corresponds to the round-trip phase2
/H9266neffL//H9261acquired by a resonant mode at wave-length/H9261, where the circumference (fiber and free-space part) isL.The inherent birefringence of a bent optical fiberwill in general give rise to resonant modes with dif-ferent polarization states.
10These modes may beused to generate circularly polarized modes that aresensitive to chirality. A wavelength shift that is equalin magnitude and opposite in sign for the two circu-larly polarized modes is a direct function of the liq-uid’s circular birefringence and hence of its opticalactivity. Thus, particular interest are relativechanges in the resonance wavelengths of a pair of leftand right circularly polarized modes centered at/H9261:
/H20879/H9004/H9261/H20849−/H20850−/H9004/H9261/H20849+/H20850

/H9261/H20879=n/H20849−/H20850−n/H20849+/H20850

nefff,/H208493/H20850where any common mode noise is automaticallyeliminated. It can also be seen that the equation de-scribing optical activity in a ring resonator is inde-pendent of the actual dimension of the ring. For agiven finesse and a given fractionf, a reduction in thesize of the ring does not lead to a loss of sensitivity.February 15, 2006 / Vol. 31, No. 4 / OPTICS LETTERS4530146-9592/06/040453-3/$15.00 © 2006 Optical Society of America