pymupdf / PyMuPDF

PyMuPDF is a high performance Python library for data extraction, analysis, conversion & manipulation of PDF (and other) documents.
https://pymupdf.readthedocs.io
GNU Affero General Public License v3.0
5.26k stars 499 forks source link

Duplicated text #2318

Closed witzatom closed 1 year ago

witzatom commented 1 year ago

Describe the bug (mandatory)

Some of the data read from this pdf is duplicated on a single page (first page).

pdf_doc.load_page(0).get_text(sort=True)
RESERVE BANK OF VANUATU\nRESERVE BANK OF VANUATU \n \nQUESTIONNAIRE FOR CONTROLLERS\nQUESTIONNAIRE FOR CONTROLLERS OF BANKS, CREDIT \n, CREDIT \nINSTITUTIONS AND ANY OTHER FINANCIAL INSTITUTIONS \nINSTITUTIONS AND ANY OTHER FINANCIAL INSTITUTIONS \nINSTITUTIONS AND ANY OTHER FINANCIAL INSTITUTIONS \nUNDER THE FIA \n(BEING A BODY CORPORATE)\n(BEING A BODY CORPORATE) \n \nNOTES FOR COMPLETION \nNOTES FOR COMPLETION\n \nA...

Unfortunately it is not exactly duplicated. It seems like there are two 'layers' or something like that and each of them has slightly different spans, sometimes the spans are split differently. So a workaround where I would deduplicate by checking for an exact span duplicate is not possible.

If you look at the pdf, it does look like its copied twice, so maybe there is some issue with there being somehow 2 pages on the first page?

Possibly related issues: https://github.com/pymupdf/PyMuPDF/issues/379 and https://github.com/pymupdf/PyMuPDF/issues/218

This is one of many spans/blocks that are duplicated in the pdf attached. file.pdf

{'bbox': (208.79991149902344,
          40.227928161621094,
          403.0159912109375,
          55.07319641113281),
 'lines': [{'bbox': (208.79991149902344,
                     40.227928161621094,
                     403.0159912109375,
                     55.07319641113281),
            'dir': (1.0, 0.0),
            'spans': [{'ascender': 0.86181640625,
                       'bbox': (208.79991149902344,
                                40.227928161621094,
                                403.0159912109375,
                                55.07319641113281),
                       'color': 0,
                       'descender': -0.26318359375,
                       'flags': 20,
                       'font': 'Garamond,Bold',
                       'origin': (208.79991149902344, 51.60028076171875),
                       'size': 13.185470581054688,
                       'text': 'RESERVE BANK OF VANUATU'}],
            'wmode': 0}],
 'number': 3,
 'type': 0}

{'bbox': (208.79991149902344,
          40.227928161621094,
          406.0125732421875,
          55.07319641113281),
 'lines': [{'bbox': (208.79991149902344,
                     40.227928161621094,
                     406.0125732421875,
                     55.07319641113281),
            'dir': (1.0, 0.0),
            'spans': [{'ascender': 0.86181640625,
                       'bbox': (208.79991149902344,
                                40.227928161621094,
                                406.0125732421875,
                                55.07319641113281),
                       'color': 0,
                       'descender': -0.26318359375,
                       'flags': 20,
                       'font': 'Garamond,Bold',
                       'origin': (208.79991149902344, 51.60028076171875),
                       'size': 13.183370590209961,
                       'text': 'RESERVE BANK OF VANUATU '}],
            'wmode': 0}],
 'number': 21,
 'type': 0}

To Reproduce (mandatory)

import fitz
from pprint import pprint
pdf_doc = fitz.open("file.pdf")
blocks = pdf_doc.load_page(0).get_text("dict", flags=fitz.TEXT_INHIBIT_SPACES, sort=True)['blocks']

pprint(blocks[0])
pprint(blocks[1])

Your configuration (mandatory)

>>> print(sys.version, "\n", sys.platform, "\n", fitz.__doc__)
3.8.10 (default, Mar 13 2023, 10:26:41) 
[GCC 9.4.0] 
 linux 

PyMuPDF 1.21.1: Python bindings for the MuPDF 1.21.1 library.
Version date: 2022-12-13 00:00:01.
Built for Python 3.8 on linux (64-bit).
JorjMcKie commented 1 year ago

This is no bug: as you have observed yourself, the text is stored like that in the file. So all we can do is discussing options to deal with cases like this.