When extracting a numbered list, the result is not as expected.

pymupdf / PyMuPDF

PyMuPDF is a high performance Python library for data extraction, analysis, conversion & manipulation of PDF (and other) documents.

https://pymupdf.readthedocs.io

GNU Affero General Public License v3.0

5.17k stars 495 forks source link

When extracting a numbered list, the result is not as expected. #3503

Closed wencan closed 4 months ago

wencan commented 4 months ago

Description of the bug

20245.pdf

In the PDF, there is a list where the numbering starts at 0. However, in the blocks I received, these numbers have moved to the next line of the text.

expect:

01 体重指数增⾼

actual:

(59.562503814697266, 51.22419738769531, 151.6444549560547, 63.239990234375, '体重指数增⾼\n01\n', 1, 0)

Screenshot from 2024-05-21 02-00-12_

How to reproduce the bug

import fitz

doc = fitz.open("2024_5_.pdf")
for page in doc:
    blocks= page.get_text('blocks')
    for block in blocks:
        print(block)

PyMuPDF version

1.24.4

Operating system

Linux

Python version

3.12

JorjMcKie commented 4 months ago

This is no bug.

As explained in the documentation, naive text extraction always delivers in the sequence as the text is present in the page's appearance source. So in this case, the PDF creator first stored "体重指数增⾼", and then "01".

You have to develop code that also extracts text coordinates and use them to sort text pieces accordingly.

But there exists a solution for this problem already: install package pymupdf4llm via python -m pip install pymupdf4llm and then do this:

import pathlib
import pymupdf4llm
data = pymupdf4llm.to_markdown("2024_5_.pdf")
pathlib.Path("2024_5_.pdf.md").write_bytes(data.encode())

This will extract the text in Markdown format and in correct reading sequence and also determine header lines from the font sizes present in the document. The result is this: 20245.pdf.md

wencan commented 4 months ago

@JorjMcKie I suggest that the program's layout processing should be closer to human reading habits. In my reading habits, "01" is to the left of "number", not below it.

JorjMcKie commented 4 months ago

First of all, there is no such thing as a universal "reading habits" - just think of Arabian / Persian / Hebrew and other right-to-left reading habits ... or even worse: mixtures of right-to-left with left-to-right text documents.

There are also scripting system top-to-bottom, character-wise.

Above all, there are numerous situations where exact information is needed about the "physical" sequence of objects (including text objects), e.g. when it comes to determine which object is covering which other one.

Last but not least: I just showed you a way how to extract text in a top-left to bottom-right sequence.

PyMuPDF lets you choose among multiple alternatives.