pymupdf / PyMuPDF

PyMuPDF is a high performance Python library for data extraction, analysis, conversion & manipulation of PDF (and other) documents.
https://pymupdf.readthedocs.io
GNU Affero General Public License v3.0
4.49k stars 443 forks source link

flags of span is not right for Chinese text #3623

Closed FounderHy closed 5 days ago

FounderHy commented 5 days ago

Description of the bug

I extract spans from PDF which contains Chinese text, English text and numbers. The flags (Italic, bold, etc) are right for English text and numbers, but the Chiese text flags are not right.

How to reproduce the bug

This is the PDF for test: 样式.pdf

The extracted styles:

{'text': '这是一个测试文本', 'font': 'HYZhongHeiKW', 'size': 10.449999809265137, 'bold': False, 'italic': False} {'text': '这是一个测试文本', 'font': 'HYShuSongErKW', 'size': 10.449999809265137, 'bold': False, 'italic': False} {'text': '这是一个测试文本', 'font': 'HYShuSongErKW', 'size': 10.449999809265137, 'bold': False, 'italic': False} {'text': '这是一个测试文本', 'font': 'HYShuSongErKW', 'size': 10.449999809265137, 'bold': False, 'italic': False} {'text': '这是', 'font': 'HYShuSongErKW', 'size': 10.449999809265137, 'bold': False, 'italic': False} {'text': '一个测试', 'font': 'HYShuSongErKW', 'size': 10.449999809265137, 'bold': False, 'italic': False} {'text': '文本', 'font': 'HYShuSongErKW', 'size': 10.449999809265137, 'bold': False, 'italic': False} {'text': '12', 'font': 'HelveticaNeue', 'size': 10.449999809265137, 'bold': False, 'italic': False} {'text': '34', 'font': 'HelveticaNeue-BoldItalic', 'size': 10.449999809265137, 'bold': True, 'italic': True} {'text': '5', 'font': 'HelveticaNeue', 'size': 10.449999809265137, 'bold': False, 'italic': False} {'text': 'This is a book', 'font': 'HelveticaNeue', 'size': 10.449999809265137, 'bold': False, 'italic': False} {'text': 'This', 'font': 'HelveticaNeue', 'size': 10.449999809265137, 'bold': False, 'italic': False} {'text': ' is a book', 'font': 'HelveticaNeue-Bold', 'size': 10.449999809265137, 'bold': True, 'italic': False} {'text': 'This', 'font': 'HelveticaNeue', 'size': 10.449999809265137, 'bold': False, 'italic': False} {'text': ' is a book', 'font': 'HelveticaNeue-BoldItalic', 'size': 10.449999809265137, 'bold': True, 'italic': True} {'text': 'This', 'font': 'HelveticaNeue', 'size': 10.449999809265137, 'bold': False, 'italic': False} {'text': ' is a book', 'font': 'HelveticaNeue-BoldItalic', 'size': 10.449999809265137, 'bold': True, 'italic': True} {'text': 'This', 'font': 'HelveticaNeue', 'size': 10.449999809265137, 'bold': False, 'italic': False} {'text': ' is a book', 'font': 'HelveticaNeue-BoldItalic', 'size': 10.449999809265137, 'bold': True, 'italic': True}

PyMuPDF version

1.24.5

Operating system

MacOS

Python version

3.10

JorjMcKie commented 5 days ago

This is not a bug. PyMuPDF and MuPDF can only report what the font says about itself. All the properties bold, italic, serif etc. are based on responses from the font. For (Py)MuPDF there is no way to check whether this is true or a lie.

There are multiple ways to simulate properties like bold, for example by outputting the same text two times with a small offset between. This is often used with Asian fonts because they are so large (usually many megabytes). The PDF creators do not want to include multiple font versions in the same PDF (1 for regular, 1 for bold, one for italic, 1 for bold-italic) so they use tricks like described.