pymupdf / PyMuPDF

PyMuPDF is a high performance Python library for data extraction, analysis, conversion & manipulation of PDF (and other) documents.
https://pymupdf.readthedocs.io
GNU Affero General Public License v3.0
5.81k stars 537 forks source link

get duplicated outline text from .getText() #172

Closed Hugloss closed 6 years ago

Hugloss commented 6 years ago

using this pdf: https://www.opel.ie/content/dam/opel/ireland/owners/manuals/pdf/vivaro/om_vivaro_kta-2769_2-en_eu_my16_ed0415_5_en_gb.pdf

for example the result for page 200: 200 200 Customer information Customer information Customer Customer information information Customer information Customer information ................ 200 200 Declaration of conformity ......... 200 Vehicle data recording and pri‐ Vehicle data recording and pri‐ vacy vacy ........................................... 202 202 Event data recorders ............... 202 Radio Frequency Identification (RFID) ..................................... 202 Customer information Customer information Declaration of conformity Declaration of conformity Transmission systems Transmission systems This vehicle has systems that transmit and/or receive radio waves subject to Directive 1999/5/EC. These systems are in compliance with the essential requirements and other relevant provisions of Directive 1999/5/EC. Copies of the original Declarations of Conformity can be obtained on our website. Radar systems Radar systems Country-specific Declarations of Conformity for radar systems are shown on the following page:

has fault text at row 7,8,9

Customer information Customer information
Declaration of conformity Declaration of conformity
Transmission systems Transmission systems

The correct text should be

Customer information
Declaration of conformity
Transmission systems
JorjMcKie commented 6 years ago

This is not specific to PyMuPDF: If you use the mudraw cli utility of MuPDF to print the text, you get this:

200
200
Customer information
Customer information

Customer
Customer
information
information

Customer information
Customer information ................ 200
200

Declaration of conformity ......... 200

Vehicle data recording and pri‐
Vehicle data recording and pri‐
vacy
vacy ........................................... 202
202

Event data recorders ............... 202
Radio Frequency Identification
(RFID) ..................................... 202

Customer information
Customer information

Declaration of conformity
Declaration of conformity

Transmission systems
Transmission systems

This vehicle has systems that
transmit and/or receive radio waves
subject to Directive 1999/5/EC.
These systems are in compliance
with the essential requirements and
other relevant provisions of
Directive 1999/5/EC. Copies of the
original Declarations of Conformity
can be obtained on our website.

Radar systems
Radar systems

Country-specific Declarations of
Conformity for radar systems are
shown on the following page:

which is exactly the same - except that PyMuPDF in contrast does not insert extra line breaks for lines within the same block. So this is a peculiarity of the PDF itself: it does contains the same word multiple times. To confirm, I used one of my utilities that analyzes a PDF page layout, drawing a rectangle around each block and then inserting the block's text into it. I am getting this:

grafik

So, what can you do to detect if words occur multiple times? Here is a little script that does this:

import fitz
doc = fitz.open(...)
page = doc[201]
from operator import itemgetter

words = page.getTextWords()   # get all words info on page
words.sort(key = itemgetter(1,0)) # sort by vertical, horizontal position

h = words[0][1] # vertical position of top-left word

oldrect = fitz.Rect() # control if word is in same rectangle
lines = [] # all the re-created lines go here
line = ""  # a fresh re-created line
for w in words:
    newrect = fitz.Rect(w[:4]) # rectangle of word
    if newrect == oldrect:     # same as old one?
        continue               # skip
    oldrect = newrect          # store our rect
    y = w[1]                   # vertical coord of word
    word = w[4]                # the word text
    if y == h:                 # on same line?
        line += word           # append word
        line += " "            # append a space
    else:                      # start new line
        lines.append(line)
        line = word + " "
        h = y
lines.append(line)             # append unfinished line
for line in lines:             # print the re-created lines
    print(line)

It's output:

200 Customer information
Customer
Customer information
information
Declaration of conformity
Transmission systems
Customer information ................ 200
This vehicle has systems that
Declaration of conformity ......... 200
transmit and/or receive radio waves
subject to Directive 1999/5/EC.
Vehicle data recording and pri‐
These systems are in compliance
vacy ........................................... 202
with the essential requirements and
Event data recorders ............... 202
other relevant provisions of
Radio Frequency Identification
Directive 1999/5/EC. Copies of the
(RFID) ..................................... 202
original Declarations of Conformity
can be obtained on our website.
Radar systems
Country-specific Declarations of
Conformity for radar systems are
shown on the following page:

Can obviously still be improved ... maybe by combining it with the layout analyzer mentioned above, or whatever.

Please tell me your platform and PyMuPDF version number for the records.

JorjMcKie commented 6 years ago

If you modify the last script a little, like so:

import fitz
doc = fitz.open(...)
page = doc[201]
from operator import itemgetter
words = page.getTextWords()
words.sort(key = itemgetter(1,0))

y = words[0][1]
x = words[0][0]

oldrect = None
lines = []
line = ""
for w in words:
    newrect = fitz.Rect(w[:4])
    if newrect == oldrect:
        continue
    oldrect = newrect
    word = w[4]
    if newrect.y0 == y:
        line += word
        line += " "
    else:
        lines.append([x, y, line])
        line = word + " "
        y = newrect.y0
        x = newrect.x0

lines.append([x, y, line])
# we acknowledge, that the page effectively has 2 columns:
for x0, y0, line in lines:
    space = ""
    if x0 > 180:
        space = " " * 60
    print(space + line)

you get this:

200 Customer information 
Customer 
                                                            Customer information 
information 
                                                            Declaration of conformity 
                                                            Transmission systems 
Customer information ................ 200 
                                                            This vehicle has systems that 
Declaration of conformity ......... 200 
                                                            transmit and/or receive radio waves 
                                                            subject to Directive 1999/5/EC. 
Vehicle data recording and pri‐ 
                                                            These systems are in compliance 
vacy ........................................... 202 
                                                            with the essential requirements and 
Event data recorders ............... 202 
                                                            other relevant provisions of 
Radio Frequency Identification 
                                                            Directive 1999/5/EC. Copies of the 
(RFID) ..................................... 202 
                                                            original Declarations of Conformity 
                                                            can be obtained on our website. 
                                                            Radar systems 
                                                            Country-specific Declarations of 
                                                            Conformity for radar systems are 
                                                            shown on the following page: 

which is closer to the original ...

JorjMcKie commented 6 years ago

Allow me a more "philisophical" comment. Pure text output of pages with a complex layout always suffers from similar difficulties. In this case it was a little more complex, because the PDF contains the exact same words in the exact same locations multiple times - invisible when using PDF readers. Apart from that, some intelligence has always to be implemented in the print script - to understand where is what, detect a column-based layout, or tables, or even more difficult if text is not left-right / top-bottom oriented, etc.

Hugloss commented 6 years ago

Thank you for your explanation, it was clear, and for the code :)

Hugloss commented 6 years ago

This pdf is really annoying... Is that why I get on page 122 cant find the title in the text?

doc = fitz.open(...)
page = doc[122]

These title: ['Conditions for an Autostop', 'Restart of the engine by the driver', 'Restart of the engine by the stop-start system', 'Fault']

And the text Driving and operating Driving and operating 121 121 Conditions for an Autostop Conditions for an Autostop The stop-start system checks if ea ch of the following conditions is fulfilled: ● the stop-start system is not manually deactivated ● the bonnet is fully closed ● the vehicle battery is sufficiently charged and in good condition ● the engine is warmed-up ● the engine coolant temperature is not too high ● the outside temperature is not too low or too high (e.g. below 0 °C or above 35 °C) ● the brake vacuum is sufficient ● the defrosting function is not activated 3 105 ● the self-cleaning function of the diesel particle filter is not active 3 122 ● the Antilock brake system (ABS) 3 128, Traction Control system (TC) 3 130 and Electron ic Stability Program (ESP®Plus) 3 131 ride control systems are not actively engaged ● the vehicle has moved since the last Autostop Otherwise an Autostop will be inhibited. Certain settings of the climate control system may inhibit an Autostop. See "Climate con trol Climate control" chapter for further information 3 105 Restart of the engine by the driver Restart of the engine by the driver Depress the clutch pedal to restart the engine. Note Note If any gear is selected, the clutch pedal must be fully depressed to restart t he engine. Control indicator Ï 3 90 extinguishes in the instrument cluster when the engine is resta rted. Restart of the engine by the stop- Restart of the engine by the stop- start system start system The selector lever must be in neutral to enable an automatic restart. If one of the following conditions occurs during an Autostop, the engine will be restart ed automatically by the stop-start system: ● the stop-start system is manually deactivated ● the bonnet is opened ● the vehicle battery is discharged ● the engine temperature is too low ● the brake vacuum is not sufficient ● the vehicle starts to move ● the defrosting function is activated 3 105 If an electrical accessory, e.g. a portable CD player, is connected to the power outlet, a brief power drop during engine restart may be noticeable.

were the title 'Restart of the engine by the stop-start system' can not be found in the text as it does not match the text 'Restart of the engine by the stop- start system'?

JorjMcKie commented 6 years ago

Hm everything looks fine to me (apart from what we discussed previously). Original page in a PDF reader: grafik Note that the physical page is 123 (1-based, 122 0-based), whereas the logical page is 121 (document's own numbering scheme). This is the output from PyMuPDF: grafik You will find Restart of the engine by the driver corresponding to the bookmark (table of contents entry) in the middle column and at position 1065 of PyMuPDF's text output.

JorjMcKie commented 6 years ago

Text "Restart of the engine by the stop- start" is at position 1378.

Hugloss commented 6 years ago

Exactly "Restart of the engine by the stop- start" is found in text but not the "Restart of the engine by the stop-start system" which is the bookmark from getToc().

Maybe this is the same problem as before but I thought that getText() printed everything as it is, so I should find all the bookmarks in the text, but this bookmark I can't find due to the different:

"stop-start" in bookmark
"stop- start" in text
JorjMcKie commented 6 years ago

no error there! Bookmark text is completely independent from other text, it could be anything. It is not generated from page text or whatever, but entered by some other means. You could have headers in text without corresponding bookmarks,and bookmark text without counterpart in the text. Bookmarks could also point to other documents or to resources in the internet, etc.

Hugloss commented 6 years ago

Hi, Thank you for all your answers :) but I have one more to ask.

You have not thought about being able to get the text for all paragraphs (with Title) for all pages? Just as you do page_text = doc[i].getText() you could do:

page_outlines_text = doc[i].getParagraphText() 
first_outline_text_for_page =  page_outlines_text[0]

Is this a feature you thought of or think you will want to add?

JorjMcKie commented 6 years ago

I am not 100% sure I understand. Let me try another synopsis:

  1. Text shown on a page on the one hand and entries in a table of contents (TOC, resp. bookmarks, resp. outlines) on the other hand have absolutely nothing to do with each other. Whenever a relationship appears to exist, it was the author of the document who made it look like so. Even headlines for chapters or paragraphs are not connected to outlines. As per the internal PDF logic, there even is no such thing as a "headline" - it's just text, may be with a different font or whatever.
  2. Therefore, there is no way to deduct anything from an outline / bookmark entry. It may or may not point to some place in the document. If it points to inside the document, you cannot expect to also find any text there - it might be empty space, or a picture, or whatever.
  3. Every text extraction method (getText() with its 6 parameter variants, plus getTextWords() and getTextBlocks()) extracts the text from the page in a sequence which does in general not equal the reading sequence, nor any other specific sequence. Only the geometry information (the bbox) tells you where the text piece is physically located on the page. But it may be the page footer text that appears first in the output.
  4. You can already now combine the page-oriented text output in an obvious way. Our documentation also contains recipies on how to use this for creating a full HTML version of any document. You can find this code snippet in the Page chapter. I didn't bother to provide separate methods for doing this, just because it didn't seem to have enough priority. And then memory consumption also needs to be considered: extracting all the text of a large document like Adobe's manual (1'310 pages!) at once maybe a problem.

Of course you could create an iterator like in the following snippet

for text in getHTML(doc):
    # each 'text' contains the html text of one page
    ...

But it never occurred to me that this is much better than just using

for page in doc:
    text = page.getText("html")
    ...

I hope I understood what you meant ...