Issue with text extraction near footer of page

pymupdf / RAG

RAG (Retrieval-Augmented Generation) Chatbot Examples Using PyMuPDF

https://pymupdf.readthedocs.io/en/latest/pymupdf4llm

GNU Affero General Public License v3.0

302 stars 57 forks source link

Issue with text extraction near footer of page #68

Closed Shreyanshcodes closed 2 months ago

Shreyanshcodes commented 2 months ago

I was trying to extract text from a pdf using pymupdf4llm, the majority of text extracted is good but facing some issues for text near footer, attaching the screenshot for reference:

JorjMcKie commented 2 months ago

Please provide a reproducing file. Just providing images is no help. But maybe you simply did not notice that the package assumes top and bottom margins of 50 points - which causes anything overlapping these areas to be ignored. Best try again with setting parameter margins=0.

Shreyanshcodes commented 2 months ago

carmanual_1.pdf carmanual_2.pdf Hi @JorjMcKie, I did try changing the footer margins earlier, but that did not work. Setting the margins to 0 works, however. The file "carmanual_1" is the one I have tested, and I sent a screenshot of it earlier. Currently, I am facing the following issues:

For "carmanual_1," the text extraction is fine, but the order of the document being parsed is incorrect. It is not following the multi-column layout properly; instead, it is reading the text in a different order. For "carmanual_2," the last 5 pages are not extracting any text.

JorjMcKie commented 2 months ago

This is actually a new issue - or two. Please be aware that this package does not do an a priori layout analysis before extracting text. It is not powered by AI or uses OCR or whatever.

On the contrary: The location of text portions is used to derive the overall layout. This implies that the algorithm can be confused if text is written a manner deviating too much from any reading order.

Another issue is that the package currently does not cope with pages that are rotated. this is the case for your second file. In a future version, we will detect and remove page rotations.

Shreyanshcodes commented 2 months ago

Okay Thanks @JorjMcKie !

Shreyanshcodes commented 2 months ago

For my 2nd PDF, none of the pages are rotated, by rotation you mean what?

JorjMcKie commented 2 months ago

This:

doc
Document('carmanual_2.pdf')
for page in doc:
    print(f"{page.number=} has {page.rotation=}")

page.number=0 has page.rotation=90
page.number=1 has page.rotation=0
page.number=2 has page.rotation=0
page.number=3 has page.rotation=0
page.number=4 has page.rotation=0
page.number=5 has page.rotation=0
page.number=6 has page.rotation=90
page.number=7 has page.rotation=90
page.number=8 has page.rotation=90
page.number=9 has page.rotation=90

Shreyanshcodes commented 2 months ago

This:

doc
Document('carmanual_2.pdf')
for page in doc:
    print(f"{page.number=} has {page.rotation=}")

page.number=0 has page.rotation=90
page.number=1 has page.rotation=0
page.number=2 has page.rotation=0
page.number=3 has page.rotation=0
page.number=4 has page.rotation=0
page.number=5 has page.rotation=0
page.number=6 has page.rotation=90
page.number=7 has page.rotation=90
page.number=8 has page.rotation=90
page.number=9 has page.rotation=90

But here for page 0 in spite of having rotation 90 its getting extracted why not for last 4?

JorjMcKie commented 2 months ago

No - it is not. In my version the "Warranty" page does not appear.

JorjMcKie commented 2 months ago

There are utility scripts here which remove page rotations without impacting visibility. You can treat that PDF with one of them and make another try.

JorjMcKie commented 2 months ago

Fixed in v0.0.9.