font_family in page.get_text() dict at span level instead of font_name

pymupdf / PyMuPDF

PyMuPDF is a high performance Python library for data extraction, analysis, conversion & manipulation of PDF (and other) documents.

https://pymupdf.readthedocs.io

GNU Affero General Public License v3.0

4.52k stars 446 forks source link

font_family in page.get_text() dict at span level instead of font_name #3546

Closed SirishaGorasa closed 4 weeks ago

SirishaGorasa commented 4 weeks ago

Description of the bug

The span object in page.get_text has font_family instead of font_name, this could be problematic while trying to recreate the text, as the same PDF can contain different subset fonts under same font family. Please do share ways we can get the original subset font name from get_text.

Page.get_fonts has the indication of exact name, but when associated with span it represents the font family.

How to reproduce the bug

traverse through page.get_text() dict until span level, and font reported indicates font_family rather than original font name.

PyMuPDF version

1.23.x or earlier

Operating system

Windows

Python version

3.8

JorjMcKie commented 4 weeks ago

This method returns the font name! Using pymupdf.TOOLS.set_subset_fontnames(True) will return the subset prefix too.

JorjMcKie commented 4 weeks ago

BTW please make sure to upgrade your Python version soon. Version 3.8 will no longer be supported beginning with some release on October. Seizing support means we will no longer create wheels and stop accepting issues.

SirishaGorasa commented 4 weeks ago

Sure, Thanks for your quick response.I would check the same and let you know.

SirishaGorasa commented 2 weeks ago

This worked. Thanks ! Can we get the encoding or the font symbolic name for each span, as there can be different encodings defined for the same base font. Therefore, Font symbolic name helps in this case.

JorjMcKie commented 2 weeks ago

This worked. Thanks ! Can we get the encoding or the font symbolic name for each span, as there can be different encodings defined for the same base font. Therefore, Font symbolic name helps in this case.

No, this is not possible. Between fonts having identical names down to even the subset prefix "ABCDEF+" cannot be differentiated.

SirishaGorasa commented 1 week ago

Can we get the font name from the span as well the base font name too? For eg.: For a span, I need to have "font" : "Calibri" and "BaseFont" : "AFHYFG+Calibri" both.

JorjMcKie commented 1 week ago

If a font is a subset or not can be determined by whether there exists a prefix made of 6 uppercase characters followed by a "+". There is no other information available.

SirishaGorasa commented 5 days ago

Is there a restriction on the number of characters in the subset font name?? For eg.:

The internal structure had the below as the subset font name /BaseFont /ABCDFG+TimesNewRomanPSMT-BoldCond and TOOLS.set_subset_fontnames(True) and span["font"] returned ABCDFG+TimesNewRomanPSMT-BoldCo

The last two characters from the subset font name are missing.

Can you let me understand why this had happened?

JorjMcKie commented 5 days ago

Yes, there is an in-built length restriction of 31 on the font name.

SirishaGorasa commented 5 days ago

Oh, is it??

Which means even though the base font name in the internal structure has the number of characters more than 31, set_subset_fontnames(TRUE), strips it to 31 characters only??

but What if there's a necessity to get the full length base font name???

JorjMcKie commented 5 days ago

No way to do this - sorry.

SirishaGorasa commented 5 days ago

That's ok.

Appreciate your quick response.