pymupdf / PyMuPDF

PyMuPDF is a high performance Python library for data extraction, analysis, conversion & manipulation of PDF (and other) documents.
https://pymupdf.readthedocs.io
GNU Affero General Public License v3.0
4.56k stars 450 forks source link

Wrong fontsize calculation in corner cases ("page.get_texttrace()") #2703

Closed JorjMcKie closed 8 months ago

JorjMcKie commented 9 months ago

Discussed in https://github.com/pymupdf/PyMuPDF/discussions/2645

Originally posted by **sky884** September 7, 2023 I'm using texttrace to extract individual characters, their formatting, origin points and bounding boxes. This has been working well, but I've come across a problem with a particular PDF. Texttrace shows a size value for the text in this PDF of 17.33, but Acrobat displays the text at 12.99. Inspecting the text with PDFXplorer shows a size value of 17.33 matching texttrace, but also a scaling transformation of 0.75 (actually the CTM shows 0.75 0 0 -0.75). This perhaps explains the difference between 17.33 and 12.99 as 17.33 * 0.75 = 12.99. Extracting text from the same PDF with get_text("rawdict") gives a size value of 12.99. Is there a way using PyMuPDF that I can extract the CTM value applied to this text, and so recalculate 17.33 as 12.99? Or some other method of getting to the 12.99 value from the 17.33 texttrace returns? I would prefer to use texttrace rather than get_text("rawdict") as it's faster and it gives a spacewidth value which might help me calculate character spacing. PyMuPDF is excellent, many thanks for developing such a great product
julian-smith-artifex-com commented 8 months ago

Fixed in 1.23.5.