Open einelson opened 1 year ago
@einelson Thank you for creating an example and sharing the issue!
Getting whitespaces right is notoriously hard. @pubpub-zz is the expert in that topic; I'll leave it to him to decide if we should leave this issue open. The issue is that PDF does not (necessarily) represent the words as words internally. In the worst case, it just gives the absolute position of each character in the document.
See https://pypdf2.readthedocs.io/en/latest/user/extract-text.html#why-text-extraction-is-hard
You can decode the PDF using
mutool clean -daf test_doc.pdf test_doc_clean.pdf
Then you can see the text streams like this:
4 0 obj
<<
/Length 3473
>>
stream
/P <</MCID 0>> BDC q
0.00000912 0 612 792 re
W* n
BT
/F1 11.04 Tf
1 0 0 1 72.024 709.54 Tm
/GS7 gs
0 g
/GS8 gs
0 G
[(This is a )9(te)-3(st)9( do)-4(cu)13(m)-4(en)12(t )-3(b)3(y)-3( )9(Et)-2(h)3(an)4( Nels)13(o)-5(n)3(.)] TJ
ET
Q
q
0.00000912 0 612 792 re
W* n
BT
/F1 11.04 Tf
1 0 0 1 252.89 709.54 Tm
0 g
0 G
[( )] TJ
ET
Q
EMC /P <</MCID 1>> BDC q
0.00000912 0 612 792 re
W* n
BT
/F1 11.04 Tf
1 0 0 1 72.024 687.1 Tm
0 g
0 G
[( )] TJ
ET
Q
EMC /P <</MCID 2>> BDC q
0.00000912 0 612 792 re
W* n
BT
/F1 11.04 Tf
1 0 0 1 72.024 664.54 Tm
0 g
0 G
[(Tuesday )8(was )-3(a)12( go)7(o)-5(d)3( t)-3(i)13(m)-4(e)9( t)7(o)-5( c)-2(all)4( )9(\()] TJ
ET
Q
q
0.00000912 0 612 792 re
W* n
BT
/F1 11.04 Tf
1 0 0 1 220.01 664.54 Tm
0 g
0 G
[(0)7(0)-3(0)] TJ
ET
Q
q
0.00000912 0 612 792 re
W* n
BT
/F1 11.04 Tf
1 0 0 1 236.81 664.54 Tm
0 g
0 G
-0.105 Tc[(\) )] TJ
ET
Q
q
0.00000912 0 612 792 re
W* n
BT
/F1 11.04 Tf
1 0 0 1 242.57 664.54 Tm
0 g
0 G
[(0)7(0)-3(0)] TJ
ET
Q
q
0.00000912 0 612 792 re
W* n
BT
/F1 11.04 Tf
1 0 0 1 259.37 664.54 Tm
0 g
0 G
[(-)] TJ
ET
Q
q
0.00000912 0 612 792 re
W* n
BT
/F1 11.04 Tf
1 0 0 1 262.61 664.54 Tm
0 g
0 G
[(0)-3(0)7(0)-3(0)] TJ
ET
Q
q
0.00000912 0 612 792 re
W* n
BT
/F1 11.04 Tf
1 0 0 1 285.05 664.54 Tm
0 g
0 G
-0.142 Tc[(. )] TJ
ET
Q
q
0.00000912 0 612 792 re
W* n
BT
/F1 11.04 Tf
1 0 0 1 290.21 664.54 Tm
0 g
0 G
[(This i)13(s his p)3(h)] TJ
ET
Q
q
0.00000912 0 612 792 re
W* n
BT
/F1 11.04 Tf
1 0 0 1 346.63 664.54 Tm
0 g
0 G
[(o)-5(n)3(e )7(m)-4(u)] TJ
ET
Q
q
0.00000912 0 612 792 re
W* n
BT
/F1 11.04 Tf
1 0 0 1 380.71 664.54 Tm
0 g
0 G
[(m)-4(b)3(er)] TJ
ET
Q
q
0.00000912 0 612 792 re
W* n
BT
/F1 11.04 Tf
1 0 0 1 404.71 664.54 Tm
0 g
0 G
-0.0221 Tc[(. )] TJ
ET
Q
q
0.00000912 0 612 792 re
W* n
BT
/F1 11.04 Tf
1 0 0 1 409.87 664.54 Tm
0 g
0 G
[(This i)13(s a )-3(ran)5(d)3(o)5(m)-4( add)5(re)10(ss f)9(o)-5(r )] TJ
ET
Q
q
0.00000912 0 612 792 re
W* n
BT
/F1 11.04 Tf
1 0 0 1 72.024 650.02 Tm
0 g
0 G
[(te)-3(sting)6( )] TJ
ET
Q
q
0.00000912 0 612 792 re
W* n
BT
/F1 11.04 Tf
1 0 0 1 105.26 650.02 Tm
0 g
0 G
[(p)3(u)3(rp)4(o)-5(s)11(es)] TJ
ET
Q
q
0.00000912 0 612 792 re
W* n
BT
/F1 11.04 Tf
1 0 0 1 146.3 650.02 Tm
0 g
0 G
[(:)] TJ
ET
Q
q
0.00000912 0 612 792 re
W* n
BT
/F1 11.04 Tf
1 0 0 1 149.18 650.02 Tm
0 g
0 G
[( )] TJ
ET
Q
q
0.00000912 0 612 792 re
W* n
BT
/F1 11.04 Tf
1 0 0 1 151.7 650.02 Tm
0 g
0 G
[(3)7(4)-3(1)7( M)-5(ap)4(l)13(e )-3(st )6(P)-4(a)12(y)-3(t)9(o)-5(n)3(v)-4(il)3(le)11( M)-5(ain)6(e)9( 4)5(5)-3(6)7(8)-3(1)-3(.)] TJ
ET
Q
q
0.00000912 0 612 792 re
W* n
BT
/F1 11.04 Tf
1 0 0 1 325.61 650.02 Tm
0 g
0 G
[( )] TJ
ET
Q
EMC /P <</MCID 3>> BDC q
0.00000912 0 612 792 re
W* n
BT
/F1 11.04 Tf
1 0 0 1 72.024 627.58 Tm
0 g
0 G
-0.0322 Tc[(Any)] TJ
ET
Q
q
0.00000912 0 612 792 re
W* n
BT
/F1 11.04 Tf
1 0 0 1 89.184 627.58 Tm
0 g
0 G
[(way)] TJ
ET
Q
q
0.00000912 0 612 792 re
W* n
BT
/F1 11.04 Tf
1 0 0 1 107.42 627.58 Tm
0 g
0 G
[(,)11( t)-3(h)3(er)10(e )-3(are)11( rand)6(o)5(m)-4( )9(whitespac)12(es )-4(h)3(er)10(e)] TJ
ET
Q
q
0.00000912 0 612 792 re
W* n
BT
/F1 11.04 Tf
1 0 0 1 272.33 627.58 Tm
0 g
0 G
[(.)] TJ
ET
Q
q
0.00000912 0 612 792 re
W* n
BT
/F1 11.04 Tf
1 0 0 1 274.97 627.58 Tm
0 g
0 G
[( )] TJ
ET
Q
EMC
endstream
endobj
Let's focus on an example where PyPDF2 added an extra whitespace: phone
became ph one
.
In the PDF, the part "This is his phone mu" is represented as:
[(This i)13(s his p)3(h)] TJ
ET
Q
q
0.00000912 0 612 792 re
W* n
BT
/F1 11.04 Tf
1 0 0 1 346.63 664.54 Tm
0 g
0 G
[(o)-5(n)3(e )7(m)-4(u)] TJ
Let's focus on an example where PyPDF2 added an extra whitespace:
phone
becameph one
.In the PDF, the part "This is his phone mu" is represented as:
[(This i)13(s his p)3(h)] TJ ET Q q 0.00000912 0 612 792 re W* n BT /F1 11.04 Tf 1 0 0 1 346.63 664.54 Tm 0 g 0 G [(o)-5(n)3(e )7(m)-4(u)] TJ
In here I would guess that PyPDF2 has inserted a white space becaucse of the 1 0 0 1 346.63 664.54 Tm
sequence : this reset the 'cursor' position at an absolute space. both in X(horiz) and Y(vert). I would guess the vertical was detected unchanged (else a line return) but we currently do not/can not - because of calculation time increase - compute the horizontal position (this requires a major change identified in the roadmap). Because of that a white space insertion is considered as the most common case.
Thank you for the quick replies and the examples!
I apologize since I am not very familiar with PDF encodings. So rather than just read the text in the PDF document, the extract_text() function tries to make sense of the encodings? Is there a reason that PyPDF2 tries to do that if this is just text extraction? I might be looking at it very simply but it looks like you can parse the text from the 'tuples' in the list style objects in the stream to extract the raw-unformatted text. Is there a method that I can use to access the PDF encoding stream to attempt to do this?
-Sorry, this bit is off topic- My end goal is to do some text replacement in a PDF. Assuming I could figure out how to parse the text encodings myself and extract the text, these were some solutions I was looking at duplicating. However I am not 100% sure if editing the stream is considered best practice either and would guess that the encodings matter here for formatting purposes.
Thank you! Ethan N
So rather than just read the text in the PDF document, the extract_text() function tries to make sense of the encodings? Is there a reason that PyPDF2 tries to do that if this is just text extraction?
PyPDF2 tries to give a useful text extraction. I have shown you the pure "text" data from above. If you want that without any interpretation, you can get it like this:
from PyPDF2 import PdfReader
from PyPDF2.generic import ContentStream
reader = PdfReader("example.pdf")
stream = ContentStream(reader.pages[0]["/Contents"].get_object(), reader)
print(stream.operations)
Give it a shot and let us know how it works :-)
My end goal is to do some text replacement in a PDF.
This is not as easy as it might look. PDF documents have pointers inside. If you change the length of anything, the pointers break. That very easily renders the complete PDF useless.
Don't forget that Mutools clean heavily simplified the PDF + your PDF is already pretty simple. In contrast, PyPDF2 needs to support all kinds of PDFs from the wild.
Give it a shot and let us know how it works :-)
Thank you! I can see the encoding stream here and can definitely see how confusing it is to make sense of it! I'll give parsing it a shot and see if I can pull out the text without the whitespaces.
This is not as easy as it might look. PDF documents have pointers inside. If you change the length of anything, the pointers break. That very easily renders the complete PDF useless.
That is good to know, are there any resources for word replacement within a pdf that I could look into or any helpful documents?
@einelson an introduction to PDF format is available here: http://preserve.mactech.com/articles/mactech/Vol.15/15.09/PDFIntro/index.html
the pdf standard is available here: https://opensource.adobe.com/dc-acrobat-sdk-docs/pdfstandards/pdfreference1.7old.pdf
I'm having the same issue with random whitespace additions and it's making regex matching nearly impossible. I'd like to +1 a fix for this even if the computation time increases. Thanks for all your work!
@brockenspectre This is not an issue of putting more computational power into the problem.
The issue is figuring out what is correct. And not only for a single PDF, but for all PDFs once could find in the wild.
I would like to add something wild I encountered to this issue. Unfortunately, I know too little about PDFs to make sense of it myself, but hopefully you lovely people can :)
Crypto n' Stocks - LinkedIn Teaser_reduced.pdf
This is a report teaser that was created by our designer in Figma.
This is how the text comes out after using pypdf:
Now, I was pretty quick in blaming Figma for probably creating a shitty file, but opening the same file in Acrobat Pro and copying any random section leads to perfectly usable text:
Im curious to hear what the reason might be! Other PDFs are working fine as well.
Thanks for the work you are doing for all of us and have a good one!
I'm not experienced with PDFs, but it's looking hard to solve.
Unfortunately, this problem is getting me stuck. I noticed that some libraries like pdfminer.six and pdfplumber haven't this problem. We could check how they are dealing with this problem.
** from #1830 If the method page.extract_text() is used. The extracted text has no white spaces.
Actual output from the sample.pdf:
Text FormattingInline formattingHere, we demonstrate various types of inline text formatting and the use of embedded fonts.
There are missing whitespaces.
Text FormattingInline formattingHere, we demonstrate various types of inline text formatting and the use of embedded fonts. ^ ^
Expecting:
Text Formatting Inline formatting Here, we demonstrate various types of inline text formatting and the use of embedded fonts.
Or Minimum:
Expecting:
Text Formatting Inline formatting Here, we demonstrate various types of inline text formatting and the use of embedded fonts.
Environment
$ python -m platform macOS-13.2.1-x86_64-i386-64bit
$ python -c "import pypdf;print(pypdf.version)" 3.0.1
Code + PDF
from PyPDF2 import PdfReader
reader = PdfReader("sample.pdf")
for page_num in range(len(reader.pages)): page = reader.pages[page_num] text = page.extract_text()
Sample
Would it be possible to have a configurable argument to tune the sensitivity to whitespace? I've tried setting .extract_text(space_width=...)
to different values but have not gotten different results in some of the examples shared above.
page.extract_text(space_width=2000)
My end goal is to do some text replacement in a PDF.
I am trying to achieve something similar. I came up with this. It works for some cases, but working with PDF is much more convoluted than I envisioned. @einelson, have you found a more robust solution?
I found that pymupdf did not have the random white space problem.
I am trying to extract text from various PDF documents to use in an NLP project. While using page.extractText() random whitespace is appearing in the outputted words when there are no spaces in the pdf document.
Environment
Using VS code and running via command prompt.
Code + PDF
This is a minimal, complete example that shows the issue:
test_doc.pdf (PDF was generated using default settings in Microsoft word). It looks like this:
The code is:
Output