PyMuPDF not catching character `𠮟`

pymupdf / PyMuPDF

PyMuPDF is a high performance Python library for data extraction, analysis, conversion & manipulation of PDF (and other) documents.

GNU Affero General Public License v3.0

5.43k stars 511 forks source link

Describe the bug (mandatory)

For some reason, PyMuPDF won't catch 𠮟 character.

To Reproduce (mandatory)

Example files: Archive.zip

import fitz

# doc = fitz.Document('aozorabunko_43768.pdf')
# page = doc[127]

doc = fitz.Document('aozorabunko_04508.pdf')
page = doc[197]

results = page.get_text('text')

for c in results:
    if c == '𠮟':
        print('Found')

When running this script, it doesn't output anything (since PyMuPDF doesn't catch that character). This character is recognized when I open the file in Firefox.

Your configuration (mandatory)

Ubuntu 20.04.4 LTS
Python 3.8.13
PyMuPDF version 1.20.1 (wheel)

pymupdf / PyMuPDF

PyMuPDF not catching character `𠮟` #1793

Describe the bug (mandatory)

To Reproduce (mandatory)

Your configuration (mandatory)