py-pdf / pypdf

A pure-python PDF library capable of splitting, merging, cropping, and transforming the pages of PDF files
https://pypdf.readthedocs.io/en/latest/
Other
8.05k stars 1.39k forks source link

Random whitespaces are inserted when using page.extract_text() #1507

Open einelson opened 1 year ago

einelson commented 1 year ago

I am trying to extract text from various PDF documents to use in an NLP project. While using page.extractText() random whitespace is appearing in the outputted words when there are no spaces in the pdf document.

Environment

Using VS code and running via command prompt.

$ python -m platform
Windows-10-10.0.22621-SP0

$ python -c "import PyPDF2;print(PyPDF2.__version__)"
2.12.1

Code + PDF

This is a minimal, complete example that shows the issue:

test_doc.pdf (PDF was generated using default settings in Microsoft word). It looks like this:

image

The code is:

import os

from PyPDF2 import PdfReader, __version__

pdf = PdfReader(os.path.join(os.getcwd(), "test_doc.pdf"))

print(f"PyPDF2=={__version__}")

text = ""
for page in pdf.pages:
    page_content = page.extract_text()
    text = text + page_content
print(text)

Output

PyPDF2==2.12.1
This is a test document by Ethan Nelson.  

Tuesday was a good time to call ( 000) 000-0000 . This is his ph one mu mber . This is a random address for 
testing purposes : 341 Maple st Paytonville Maine 45681.  
Anyway, there are random whitespaces here . 
MartinThoma commented 1 year ago

@einelson Thank you for creating an example and sharing the issue!

Getting whitespaces right is notoriously hard. @pubpub-zz is the expert in that topic; I'll leave it to him to decide if we should leave this issue open. The issue is that PDF does not (necessarily) represent the words as words internally. In the worst case, it just gives the absolute position of each character in the document.

See https://pypdf2.readthedocs.io/en/latest/user/extract-text.html#why-text-extraction-is-hard

MartinThoma commented 1 year ago

You can decode the PDF using

mutool clean -daf test_doc.pdf test_doc_clean.pdf

Then you can see the text streams like this:

4 0 obj
<<
  /Length 3473
>>
stream
 /P <</MCID 0>> BDC q
0.00000912 0 612 792 re
W* n
BT
/F1 11.04 Tf
1 0 0 1 72.024 709.54 Tm
/GS7 gs
0 g
/GS8 gs
0 G
[(This is a )9(te)-3(st)9( do)-4(cu)13(m)-4(en)12(t )-3(b)3(y)-3( )9(Et)-2(h)3(an)4( Nels)13(o)-5(n)3(.)] TJ
ET
Q
q
0.00000912 0 612 792 re
W* n
BT
/F1 11.04 Tf
1 0 0 1 252.89 709.54 Tm
0 g
0 G
[( )] TJ
ET
Q
 EMC  /P <</MCID 1>> BDC q
0.00000912 0 612 792 re
W* n
BT
/F1 11.04 Tf
1 0 0 1 72.024 687.1 Tm
0 g
0 G
[( )] TJ
ET
Q
 EMC  /P <</MCID 2>> BDC q
0.00000912 0 612 792 re
W* n
BT
/F1 11.04 Tf
1 0 0 1 72.024 664.54 Tm
0 g
0 G
[(Tuesday )8(was )-3(a)12( go)7(o)-5(d)3( t)-3(i)13(m)-4(e)9( t)7(o)-5( c)-2(all)4( )9(\()] TJ
ET
Q
q
0.00000912 0 612 792 re
W* n
BT
/F1 11.04 Tf
1 0 0 1 220.01 664.54 Tm
0 g
0 G
[(0)7(0)-3(0)] TJ
ET
Q
q
0.00000912 0 612 792 re
W* n
BT
/F1 11.04 Tf
1 0 0 1 236.81 664.54 Tm
0 g
0 G
 -0.105 Tc[(\) )] TJ
ET
Q
q
0.00000912 0 612 792 re
W* n
BT
/F1 11.04 Tf
1 0 0 1 242.57 664.54 Tm
0 g
0 G
[(0)7(0)-3(0)] TJ
ET
Q
q
0.00000912 0 612 792 re
W* n
BT
/F1 11.04 Tf
1 0 0 1 259.37 664.54 Tm
0 g
0 G
[(-)] TJ
ET
Q
q
0.00000912 0 612 792 re
W* n
BT
/F1 11.04 Tf
1 0 0 1 262.61 664.54 Tm
0 g
0 G
[(0)-3(0)7(0)-3(0)] TJ
ET
Q
q
0.00000912 0 612 792 re
W* n
BT
/F1 11.04 Tf
1 0 0 1 285.05 664.54 Tm
0 g
0 G
 -0.142 Tc[(. )] TJ
ET
Q
q
0.00000912 0 612 792 re
W* n
BT
/F1 11.04 Tf
1 0 0 1 290.21 664.54 Tm
0 g
0 G
[(This i)13(s his p)3(h)] TJ
ET
Q
q
0.00000912 0 612 792 re
W* n
BT
/F1 11.04 Tf
1 0 0 1 346.63 664.54 Tm
0 g
0 G
[(o)-5(n)3(e )7(m)-4(u)] TJ
ET
Q
q
0.00000912 0 612 792 re
W* n
BT
/F1 11.04 Tf
1 0 0 1 380.71 664.54 Tm
0 g
0 G
[(m)-4(b)3(er)] TJ
ET
Q
q
0.00000912 0 612 792 re
W* n
BT
/F1 11.04 Tf
1 0 0 1 404.71 664.54 Tm
0 g
0 G
 -0.0221 Tc[(. )] TJ
ET
Q
q
0.00000912 0 612 792 re
W* n
BT
/F1 11.04 Tf
1 0 0 1 409.87 664.54 Tm
0 g
0 G
[(This i)13(s a )-3(ran)5(d)3(o)5(m)-4( add)5(re)10(ss f)9(o)-5(r )] TJ
ET
Q
q
0.00000912 0 612 792 re
W* n
BT
/F1 11.04 Tf
1 0 0 1 72.024 650.02 Tm
0 g
0 G
[(te)-3(sting)6( )] TJ
ET
Q
q
0.00000912 0 612 792 re
W* n
BT
/F1 11.04 Tf
1 0 0 1 105.26 650.02 Tm
0 g
0 G
[(p)3(u)3(rp)4(o)-5(s)11(es)] TJ
ET
Q
q
0.00000912 0 612 792 re
W* n
BT
/F1 11.04 Tf
1 0 0 1 146.3 650.02 Tm
0 g
0 G
[(:)] TJ
ET
Q
q
0.00000912 0 612 792 re
W* n
BT
/F1 11.04 Tf
1 0 0 1 149.18 650.02 Tm
0 g
0 G
[( )] TJ
ET
Q
q
0.00000912 0 612 792 re
W* n
BT
/F1 11.04 Tf
1 0 0 1 151.7 650.02 Tm
0 g
0 G
[(3)7(4)-3(1)7( M)-5(ap)4(l)13(e )-3(st )6(P)-4(a)12(y)-3(t)9(o)-5(n)3(v)-4(il)3(le)11( M)-5(ain)6(e)9( 4)5(5)-3(6)7(8)-3(1)-3(.)] TJ
ET
Q
q
0.00000912 0 612 792 re
W* n
BT
/F1 11.04 Tf
1 0 0 1 325.61 650.02 Tm
0 g
0 G
[( )] TJ
ET
Q
 EMC  /P <</MCID 3>> BDC q
0.00000912 0 612 792 re
W* n
BT
/F1 11.04 Tf
1 0 0 1 72.024 627.58 Tm
0 g
0 G
 -0.0322 Tc[(Any)] TJ
ET
Q
q
0.00000912 0 612 792 re
W* n
BT
/F1 11.04 Tf
1 0 0 1 89.184 627.58 Tm
0 g
0 G
[(way)] TJ
ET
Q
q
0.00000912 0 612 792 re
W* n
BT
/F1 11.04 Tf
1 0 0 1 107.42 627.58 Tm
0 g
0 G
[(,)11( t)-3(h)3(er)10(e )-3(are)11( rand)6(o)5(m)-4( )9(whitespac)12(es )-4(h)3(er)10(e)] TJ
ET
Q
q
0.00000912 0 612 792 re
W* n
BT
/F1 11.04 Tf
1 0 0 1 272.33 627.58 Tm
0 g
0 G
[(.)] TJ
ET
Q
q
0.00000912 0 612 792 re
W* n
BT
/F1 11.04 Tf
1 0 0 1 274.97 627.58 Tm
0 g
0 G
[( )] TJ
ET
Q
 EMC 
endstream
endobj
MartinThoma commented 1 year ago

Let's focus on an example where PyPDF2 added an extra whitespace: phone became ph one.

In the PDF, the part "This is his phone mu" is represented as:

[(This i)13(s his p)3(h)] TJ
ET
Q
q
0.00000912 0 612 792 re
W* n
BT
/F1 11.04 Tf
1 0 0 1 346.63 664.54 Tm
0 g
0 G
[(o)-5(n)3(e )7(m)-4(u)] TJ
pubpub-zz commented 1 year ago

Let's focus on an example where PyPDF2 added an extra whitespace: phone became ph one.

In the PDF, the part "This is his phone mu" is represented as:

[(This i)13(s his p)3(h)] TJ
ET
Q
q
0.00000912 0 612 792 re
W* n
BT
/F1 11.04 Tf
1 0 0 1 346.63 664.54 Tm
0 g
0 G
[(o)-5(n)3(e )7(m)-4(u)] TJ

In here I would guess that PyPDF2 has inserted a white space becaucse of the 1 0 0 1 346.63 664.54 Tm sequence : this reset the 'cursor' position at an absolute space. both in X(horiz) and Y(vert). I would guess the vertical was detected unchanged (else a line return) but we currently do not/can not - because of calculation time increase - compute the horizontal position (this requires a major change identified in the roadmap). Because of that a white space insertion is considered as the most common case.

einelson commented 1 year ago

Thank you for the quick replies and the examples!

I apologize since I am not very familiar with PDF encodings. So rather than just read the text in the PDF document, the extract_text() function tries to make sense of the encodings? Is there a reason that PyPDF2 tries to do that if this is just text extraction? I might be looking at it very simply but it looks like you can parse the text from the 'tuples' in the list style objects in the stream to extract the raw-unformatted text. Is there a method that I can use to access the PDF encoding stream to attempt to do this?

-Sorry, this bit is off topic- My end goal is to do some text replacement in a PDF. Assuming I could figure out how to parse the text encodings myself and extract the text, these were some solutions I was looking at duplicating. However I am not 100% sure if editing the stream is considered best practice either and would guess that the encodings matter here for formatting purposes.

Thank you! Ethan N

MartinThoma commented 1 year ago

So rather than just read the text in the PDF document, the extract_text() function tries to make sense of the encodings? Is there a reason that PyPDF2 tries to do that if this is just text extraction?

PyPDF2 tries to give a useful text extraction. I have shown you the pure "text" data from above. If you want that without any interpretation, you can get it like this:

from PyPDF2 import PdfReader
from PyPDF2.generic import ContentStream

reader = PdfReader("example.pdf")
stream = ContentStream(reader.pages[0]["/Contents"].get_object(), reader)
print(stream.operations)

Give it a shot and let us know how it works :-)

My end goal is to do some text replacement in a PDF.

This is not as easy as it might look. PDF documents have pointers inside. If you change the length of anything, the pointers break. That very easily renders the complete PDF useless.

MartinThoma commented 1 year ago

Don't forget that Mutools clean heavily simplified the PDF + your PDF is already pretty simple. In contrast, PyPDF2 needs to support all kinds of PDFs from the wild.

einelson commented 1 year ago

Give it a shot and let us know how it works :-)

Thank you! I can see the encoding stream here and can definitely see how confusing it is to make sense of it! I'll give parsing it a shot and see if I can pull out the text without the whitespaces.

This is not as easy as it might look. PDF documents have pointers inside. If you change the length of anything, the pointers break. That very easily renders the complete PDF useless.

That is good to know, are there any resources for word replacement within a pdf that I could look into or any helpful documents?

pubpub-zz commented 1 year ago

@einelson an introduction to PDF format is available here: http://preserve.mactech.com/articles/mactech/Vol.15/15.09/PDFIntro/index.html

the pdf standard is available here: https://opensource.adobe.com/dc-acrobat-sdk-docs/pdfstandards/pdfreference1.7old.pdf

brockenspectre commented 1 year ago

I'm having the same issue with random whitespace additions and it's making regex matching nearly impossible. I'd like to +1 a fix for this even if the computation time increases. Thanks for all your work!

MartinThoma commented 1 year ago

@brockenspectre This is not an issue of putting more computational power into the problem.

The issue is figuring out what is correct. And not only for a single PDF, but for all PDFs once could find in the wild.

JaWas2019 commented 1 year ago

I would like to add something wild I encountered to this issue. Unfortunately, I know too little about PDFs to make sense of it myself, but hopefully you lovely people can :)

Crypto n' Stocks - LinkedIn Teaser_reduced.pdf

This is a report teaser that was created by our designer in Figma.

This is how the text comes out after using pypdf:

image

Now, I was pretty quick in blaming Figma for probably creating a shitty file, but opening the same file in Acrobat Pro and copying any random section leads to perfectly usable text:

image

Im curious to hear what the reason might be! Other PDFs are working fine as well.

Thanks for the work you are doing for all of us and have a good one!

renanzulian commented 1 year ago

I'm not experienced with PDFs, but it's looking hard to solve.

Unfortunately, this problem is getting me stuck. I noticed that some libraries like pdfminer.six and pdfplumber haven't this problem. We could check how they are dealing with this problem.

pubpub-zz commented 1 year ago

** from #1830 If the method page.extract_text() is used. The extracted text has no white spaces.

Actual output from the sample.pdf:

Text FormattingInline formattingHere, we demonstrate various types of inline text formatting and the use of embedded fonts.

There are missing whitespaces.

Text FormattingInline formattingHere, we demonstrate various types of inline text formatting and the use of embedded fonts. ^ ^

Expecting:

Text Formatting Inline formatting Here, we demonstrate various types of inline text formatting and the use of embedded fonts.

Or Minimum:

Expecting:

Text Formatting Inline formatting Here, we demonstrate various types of inline text formatting and the use of embedded fonts.

Environment

$ python -m platform macOS-13.2.1-x86_64-i386-64bit

$ python -c "import pypdf;print(pypdf.version)" 3.0.1

Code + PDF

from PyPDF2 import PdfReader

reader = PdfReader("sample.pdf")

for page_num in range(len(reader.pages)): page = reader.pages[page_num] text = page.extract_text()

Sample

sample.pdf

thatperson42 commented 1 year ago

Would it be possible to have a configurable argument to tune the sensitivity to whitespace? I've tried setting .extract_text(space_width=...) to different values but have not gotten different results in some of the examples shared above. page.extract_text(space_width=2000)

hoehermann commented 6 months ago

My end goal is to do some text replacement in a PDF.

I am trying to achieve something similar. I came up with this. It works for some cases, but working with PDF is much more convoluted than I envisioned. @einelson, have you found a more robust solution?

gregdingle commented 6 months ago

I found that pymupdf did not have the random white space problem.