Closed Hugloss closed 6 years ago
This is not specific to PyMuPDF: If you use the mudraw cli utility of MuPDF to print the text, you get this:
200
200
Customer information
Customer information
Customer
Customer
information
information
Customer information
Customer information ................ 200
200
Declaration of conformity ......... 200
Vehicle data recording and pri‐
Vehicle data recording and pri‐
vacy
vacy ........................................... 202
202
Event data recorders ............... 202
Radio Frequency Identification
(RFID) ..................................... 202
Customer information
Customer information
Declaration of conformity
Declaration of conformity
Transmission systems
Transmission systems
This vehicle has systems that
transmit and/or receive radio waves
subject to Directive 1999/5/EC.
These systems are in compliance
with the essential requirements and
other relevant provisions of
Directive 1999/5/EC. Copies of the
original Declarations of Conformity
can be obtained on our website.
Radar systems
Radar systems
Country-specific Declarations of
Conformity for radar systems are
shown on the following page:
which is exactly the same - except that PyMuPDF in contrast does not insert extra line breaks for lines within the same block. So this is a peculiarity of the PDF itself: it does contains the same word multiple times. To confirm, I used one of my utilities that analyzes a PDF page layout, drawing a rectangle around each block and then inserting the block's text into it. I am getting this:
So, what can you do to detect if words occur multiple times? Here is a little script that does this:
import fitz
doc = fitz.open(...)
page = doc[201]
from operator import itemgetter
words = page.getTextWords() # get all words info on page
words.sort(key = itemgetter(1,0)) # sort by vertical, horizontal position
h = words[0][1] # vertical position of top-left word
oldrect = fitz.Rect() # control if word is in same rectangle
lines = [] # all the re-created lines go here
line = "" # a fresh re-created line
for w in words:
newrect = fitz.Rect(w[:4]) # rectangle of word
if newrect == oldrect: # same as old one?
continue # skip
oldrect = newrect # store our rect
y = w[1] # vertical coord of word
word = w[4] # the word text
if y == h: # on same line?
line += word # append word
line += " " # append a space
else: # start new line
lines.append(line)
line = word + " "
h = y
lines.append(line) # append unfinished line
for line in lines: # print the re-created lines
print(line)
It's output:
200 Customer information
Customer
Customer information
information
Declaration of conformity
Transmission systems
Customer information ................ 200
This vehicle has systems that
Declaration of conformity ......... 200
transmit and/or receive radio waves
subject to Directive 1999/5/EC.
Vehicle data recording and pri‐
These systems are in compliance
vacy ........................................... 202
with the essential requirements and
Event data recorders ............... 202
other relevant provisions of
Radio Frequency Identification
Directive 1999/5/EC. Copies of the
(RFID) ..................................... 202
original Declarations of Conformity
can be obtained on our website.
Radar systems
Country-specific Declarations of
Conformity for radar systems are
shown on the following page:
Can obviously still be improved ... maybe by combining it with the layout analyzer mentioned above, or whatever.
Please tell me your platform and PyMuPDF version number for the records.
If you modify the last script a little, like so:
import fitz
doc = fitz.open(...)
page = doc[201]
from operator import itemgetter
words = page.getTextWords()
words.sort(key = itemgetter(1,0))
y = words[0][1]
x = words[0][0]
oldrect = None
lines = []
line = ""
for w in words:
newrect = fitz.Rect(w[:4])
if newrect == oldrect:
continue
oldrect = newrect
word = w[4]
if newrect.y0 == y:
line += word
line += " "
else:
lines.append([x, y, line])
line = word + " "
y = newrect.y0
x = newrect.x0
lines.append([x, y, line])
# we acknowledge, that the page effectively has 2 columns:
for x0, y0, line in lines:
space = ""
if x0 > 180:
space = " " * 60
print(space + line)
you get this:
200 Customer information
Customer
Customer information
information
Declaration of conformity
Transmission systems
Customer information ................ 200
This vehicle has systems that
Declaration of conformity ......... 200
transmit and/or receive radio waves
subject to Directive 1999/5/EC.
Vehicle data recording and pri‐
These systems are in compliance
vacy ........................................... 202
with the essential requirements and
Event data recorders ............... 202
other relevant provisions of
Radio Frequency Identification
Directive 1999/5/EC. Copies of the
(RFID) ..................................... 202
original Declarations of Conformity
can be obtained on our website.
Radar systems
Country-specific Declarations of
Conformity for radar systems are
shown on the following page:
which is closer to the original ...
Allow me a more "philisophical" comment. Pure text output of pages with a complex layout always suffers from similar difficulties. In this case it was a little more complex, because the PDF contains the exact same words in the exact same locations multiple times - invisible when using PDF readers. Apart from that, some intelligence has always to be implemented in the print script - to understand where is what, detect a column-based layout, or tables, or even more difficult if text is not left-right / top-bottom oriented, etc.
Thank you for your explanation, it was clear, and for the code :)
This pdf is really annoying... Is that why I get on page 122 cant find the title in the text?
doc = fitz.open(...)
page = doc[122]
These title:
['Conditions for an Autostop', 'Restart of the engine by the driver', 'Restart of the engine by the stop-start system', 'Fault']
And the text
Driving and operating Driving and operating 121 121 Conditions for an Autostop Conditions for an Autostop The stop-start system checks if ea ch of the following conditions is fulfilled: ● the stop-start system is not manually deactivated ● the bonnet is fully closed ● the vehicle battery is sufficiently charged and in good condition ● the engine is warmed-up ● the engine coolant temperature is not too high ● the outside temperature is not too low or too high (e.g. below 0 °C or above 35 °C) ● the brake vacuum is sufficient ● the defrosting function is not activated 3 105 ● the self-cleaning function of the diesel particle filter is not active 3 122 ● the Antilock brake system (ABS) 3 128, Traction Control system (TC) 3 130 and Electron ic Stability Program (ESP®Plus) 3 131 ride control systems are not actively engaged ● the vehicle has moved since the last Autostop Otherwise an Autostop will be inhibited. Certain settings of the climate control system may inhibit an Autostop. See "Climate con trol Climate control" chapter for further information 3 105 Restart of the engine by the driver Restart of the engine by the driver Depress the clutch pedal to restart the engine. Note Note If any gear is selected, the clutch pedal must be fully depressed to restart t he engine. Control indicator Ï 3 90 extinguishes in the instrument cluster when the engine is resta rted. Restart of the engine by the stop- Restart of the engine by the stop- start system start system The selector lever must be in neutral to enable an automatic restart. If one of the following conditions occurs during an Autostop, the engine will be restart ed automatically by the stop-start system: ● the stop-start system is manually deactivated ● the bonnet is opened ● the vehicle battery is discharged ● the engine temperature is too low ● the brake vacuum is not sufficient ● the vehicle starts to move ● the defrosting function is activated 3 105 If an electrical accessory, e.g. a portable CD player, is connected to the power outlet, a brief power drop during engine restart may be noticeable.
were the title 'Restart of the engine by the stop-start system'
can not be found in the text as it does not match the text 'Restart of the engine by the stop- start system'
?
Hm everything looks fine to me (apart from what we discussed previously).
Original page in a PDF reader:
Note that the physical page is 123 (1-based, 122 0-based), whereas the logical page is 121 (document's own numbering scheme).
This is the output from PyMuPDF:
You will find Restart of the engine by the driver
corresponding to the bookmark (table of contents entry) in the middle column and at position 1065 of PyMuPDF's text output.
Text "Restart of the engine by the stop- start"
is at position 1378.
Exactly "Restart of the engine by the stop- start"
is found in text but not the "Restart of the engine by the stop-start system"
which is the bookmark from getToc().
Maybe this is the same problem as before but I thought that getText() printed everything as it is, so I should find all the bookmarks in the text, but this bookmark I can't find due to the different:
"stop-start" in bookmark
"stop- start" in text
no error there! Bookmark text is completely independent from other text, it could be anything. It is not generated from page text or whatever, but entered by some other means. You could have headers in text without corresponding bookmarks,and bookmark text without counterpart in the text. Bookmarks could also point to other documents or to resources in the internet, etc.
Hi, Thank you for all your answers :) but I have one more to ask.
You have not thought about being able to get the text for all paragraphs (with Title) for all pages?
Just as you do
page_text = doc[i].getText()
you could do:
page_outlines_text = doc[i].getParagraphText()
first_outline_text_for_page = page_outlines_text[0]
Is this a feature you thought of or think you will want to add?
I am not 100% sure I understand. Let me try another synopsis:
getText()
with its 6 parameter variants, plus getTextWords()
and getTextBlocks()
) extracts the text from the page in a sequence which does in general not equal the reading sequence, nor any other specific sequence. Only the geometry information (the bbox
) tells you where the text piece is physically located on the page. But it may be the page footer text that appears first in the output.Page
chapter. I didn't bother to provide separate methods for doing this, just because it didn't seem to have enough priority. And then memory consumption also needs to be considered: extracting all the text of a large document like Adobe's manual (1'310 pages!) at once maybe a problem.Of course you could create an iterator like in the following snippet
for text in getHTML(doc):
# each 'text' contains the html text of one page
...
But it never occurred to me that this is much better than just using
for page in doc:
text = page.getText("html")
...
I hope I understood what you meant ...
using this pdf: https://www.opel.ie/content/dam/opel/ireland/owners/manuals/pdf/vivaro/om_vivaro_kta-2769_2-en_eu_my16_ed0415_5_en_gb.pdf
for example the result for page 200:
200 200 Customer information Customer information Customer Customer information information Customer information Customer information ................ 200 200 Declaration of conformity ......... 200 Vehicle data recording and pri‐ Vehicle data recording and pri‐ vacy vacy ........................................... 202 202 Event data recorders ............... 202 Radio Frequency Identification (RFID) ..................................... 202 Customer information Customer information Declaration of conformity Declaration of conformity Transmission systems Transmission systems This vehicle has systems that transmit and/or receive radio waves subject to Directive 1999/5/EC. These systems are in compliance with the essential requirements and other relevant provisions of Directive 1999/5/EC. Copies of the original Declarations of Conformity can be obtained on our website. Radar systems Radar systems Country-specific Declarations of Conformity for radar systems are shown on the following page:
has fault text at row 7,8,9
The correct text should be