Closed FounderHy closed 5 days ago
This is not a bug. PyMuPDF and MuPDF can only report what the font says about itself. All the properties bold, italic, serif etc. are based on responses from the font. For (Py)MuPDF there is no way to check whether this is true or a lie.
There are multiple ways to simulate properties like bold, for example by outputting the same text two times with a small offset between. This is often used with Asian fonts because they are so large (usually many megabytes). The PDF creators do not want to include multiple font versions in the same PDF (1 for regular, 1 for bold, one for italic, 1 for bold-italic) so they use tricks like described.
Description of the bug
I extract spans from PDF which contains Chinese text, English text and numbers. The flags (Italic, bold, etc) are right for English text and numbers, but the Chiese text flags are not right.
How to reproduce the bug
This is the PDF for test: 样式.pdf
The extracted styles:
{'text': '这是一个测试文本', 'font': 'HYZhongHeiKW', 'size': 10.449999809265137, 'bold': False, 'italic': False} {'text': '这是一个测试文本', 'font': 'HYShuSongErKW', 'size': 10.449999809265137, 'bold': False, 'italic': False} {'text': '这是一个测试文本', 'font': 'HYShuSongErKW', 'size': 10.449999809265137, 'bold': False, 'italic': False} {'text': '这是一个测试文本', 'font': 'HYShuSongErKW', 'size': 10.449999809265137, 'bold': False, 'italic': False} {'text': '这是', 'font': 'HYShuSongErKW', 'size': 10.449999809265137, 'bold': False, 'italic': False} {'text': '一个测试', 'font': 'HYShuSongErKW', 'size': 10.449999809265137, 'bold': False, 'italic': False} {'text': '文本', 'font': 'HYShuSongErKW', 'size': 10.449999809265137, 'bold': False, 'italic': False} {'text': '12', 'font': 'HelveticaNeue', 'size': 10.449999809265137, 'bold': False, 'italic': False} {'text': '34', 'font': 'HelveticaNeue-BoldItalic', 'size': 10.449999809265137, 'bold': True, 'italic': True} {'text': '5', 'font': 'HelveticaNeue', 'size': 10.449999809265137, 'bold': False, 'italic': False} {'text': 'This is a book', 'font': 'HelveticaNeue', 'size': 10.449999809265137, 'bold': False, 'italic': False} {'text': 'This', 'font': 'HelveticaNeue', 'size': 10.449999809265137, 'bold': False, 'italic': False} {'text': ' is a book', 'font': 'HelveticaNeue-Bold', 'size': 10.449999809265137, 'bold': True, 'italic': False} {'text': 'This', 'font': 'HelveticaNeue', 'size': 10.449999809265137, 'bold': False, 'italic': False} {'text': ' is a book', 'font': 'HelveticaNeue-BoldItalic', 'size': 10.449999809265137, 'bold': True, 'italic': True} {'text': 'This', 'font': 'HelveticaNeue', 'size': 10.449999809265137, 'bold': False, 'italic': False} {'text': ' is a book', 'font': 'HelveticaNeue-BoldItalic', 'size': 10.449999809265137, 'bold': True, 'italic': True} {'text': 'This', 'font': 'HelveticaNeue', 'size': 10.449999809265137, 'bold': False, 'italic': False} {'text': ' is a book', 'font': 'HelveticaNeue-BoldItalic', 'size': 10.449999809265137, 'bold': True, 'italic': True}
PyMuPDF version
1.24.5
Operating system
MacOS
Python version
3.10