pymupdf / PyMuPDF

PyMuPDF is a high performance Python library for data extraction, analysis, conversion & manipulation of PDF (and other) documents.
https://pymupdf.readthedocs.io
GNU Affero General Public License v3.0
5.18k stars 496 forks source link

Linebreak inserted between each letter #3650

Closed rezemika closed 1 month ago

rezemika commented 3 months ago

Description of the bug

Hey, thank you so much for this amazing tool!

I am using PyMuPDF to parse many official french documents, they contain a cover, a table of contents, and pages of scanned content. The vast majority of them is read with no problem, but for a small number of them, a linebreak is inserted between each letter of the content, making it almost unreadable.

Here are links to a few documents where this happens:

How to reproduce the bug

For instance, here is an example with the second mentioned document:

>>> import pymupdf
>>> f = "2023-04-28-ee04e9ccb016e7806a7cf92a48155834.pdf"
>>> doc = pymupdf.Document(f)
>>> doc[0].get_text("blocks")
[
    (164.6999969482422, 377.63739013671875, 436.3139953613281, 394.6753845214844, 'R\nE\nC\nU\nE\nI\nL\n \nD\nE\nS\n \nA\nC\nT\nE\nS\n \nA\nD\nMI\nN\nI\nS\nT\nR\nA\nT\nI\nF\nS\n', 0, 0),
    (225.0, 531.0374145507812, 376.00396728515625, 548.0614013671875, 'n\n°\n \n7\n7\n \nd\nu\n \n2\n8\n \na\nv\nr\ni\nl\n \n2\n0\n2\n3\n', 1, 0)
]

>>> pymupdf.version
('1.24.7', '1.24.4', '20240626000001')

And here is its first page as I see it:

Cover of the second mentioned document.

Please let me know if I can provide any further information!

PS: Is there any "debugging tool" that would allow you to view text and content blocks as they're seen by PyMuPDF for easier analysis?

PyMuPDF version

1.24.7

Operating system

Linux

Python version

3.11

JorjMcKie commented 3 months ago

This is a MuPDF problem which I will transfer to their issue system. test.pdf

MuPDF issue link: https://bugs.ghostscript.com/show_bug.cgi?id=707859

julian-smith-artifex-com commented 1 month ago

Fixed in 1.24.10.