Some of the data read from this pdf is duplicated on a single page (first page).
pdf_doc.load_page(0).get_text(sort=True)
RESERVE BANK OF VANUATU\nRESERVE BANK OF VANUATU \n \nQUESTIONNAIRE FOR CONTROLLERS\nQUESTIONNAIRE FOR CONTROLLERS OF BANKS, CREDIT \n, CREDIT \nINSTITUTIONS AND ANY OTHER FINANCIAL INSTITUTIONS \nINSTITUTIONS AND ANY OTHER FINANCIAL INSTITUTIONS \nINSTITUTIONS AND ANY OTHER FINANCIAL INSTITUTIONS \nUNDER THE FIA \n(BEING A BODY CORPORATE)\n(BEING A BODY CORPORATE) \n \nNOTES FOR COMPLETION \nNOTES FOR COMPLETION\n \nA...
Unfortunately it is not exactly duplicated. It seems like there are two 'layers' or something like that and each of them has slightly different spans, sometimes the spans are split differently. So a workaround where I would deduplicate by checking for an exact span duplicate is not possible.
If you look at the pdf, it does look like its copied twice, so maybe there is some issue with there being somehow 2 pages on the first page?
>>> print(sys.version, "\n", sys.platform, "\n", fitz.__doc__)
3.8.10 (default, Mar 13 2023, 10:26:41)
[GCC 9.4.0]
linux
PyMuPDF 1.21.1: Python bindings for the MuPDF 1.21.1 library.
Version date: 2022-12-13 00:00:01.
Built for Python 3.8 on linux (64-bit).
This is no bug: as you have observed yourself, the text is stored like that in the file.
So all we can do is discussing options to deal with cases like this.
Describe the bug (mandatory)
Some of the data read from this pdf is duplicated on a single page (first page).
Unfortunately it is not exactly duplicated. It seems like there are two 'layers' or something like that and each of them has slightly different spans, sometimes the spans are split differently. So a workaround where I would deduplicate by checking for an exact span duplicate is not possible.
If you look at the pdf, it does look like its copied twice, so maybe there is some issue with there being somehow 2 pages on the first page?
Possibly related issues: https://github.com/pymupdf/PyMuPDF/issues/379 and https://github.com/pymupdf/PyMuPDF/issues/218
This is one of many spans/blocks that are duplicated in the pdf attached. file.pdf
To Reproduce (mandatory)
Your configuration (mandatory)