pymupdf / PyMuPDF

PyMuPDF is a high performance Python library for data extraction, analysis, conversion & manipulation of PDF (and other) documents.
https://pymupdf.readthedocs.io
GNU Affero General Public License v3.0
5.43k stars 511 forks source link

PyMuPDF not catching character `𠮟` #1793

Closed vinhtq115 closed 2 years ago

vinhtq115 commented 2 years ago

Describe the bug (mandatory)

For some reason, PyMuPDF won't catch 𠮟 character.

To Reproduce (mandatory)

Example files: Archive.zip

import fitz

# doc = fitz.Document('aozorabunko_43768.pdf')
# page = doc[127]

doc = fitz.Document('aozorabunko_04508.pdf')
page = doc[197]

results = page.get_text('text')

for c in results:
    if c == '𠮟':
        print('Found')

When running this script, it doesn't output anything (since PyMuPDF doesn't catch that character). This character is recognized when I open the file in Firefox.

Your configuration (mandatory)

JorjMcKie commented 2 years ago

Adobe Acrobat, XPDF-Reader and others don't find it either. So it has nothing to do with (Py-) MuPDF and is no bug.

Note: The fact that you can see that character in a viewer has nothing to do with whether you can extract it.