pymupdf / RAG

RAG (Retrieval-Augmented Generation) Chatbot Examples Using PyMuPDF
https://pymupdf.readthedocs.io/en/latest/pymupdf4llm
GNU Affero General Public License v3.0
518 stars 81 forks source link

No text for some pages of a pdf file #154

Open nmakhotkin opened 1 month ago

nmakhotkin commented 1 month ago

Given this PDF, there is no text output until page 8 yet it can be exported well (without formatting etc.) using directly pymupdf.

68478448.pdf

import pymupdf4llm
import pymupdf

text = pymupdf4llm.to_markdown(
    '68478448.pdf',
    margins=(0, 15, 0, 15),
    write_images=True,
    show_progress=True,
)
pymupdf_text = pymupdf.open('68478448.pdf').get_page_text(0)

print('DÉNOMINATION DU MÉDICAMENT' in text)  # False
print('DÉNOMINATION DU MÉDICAMENT' in pymupdf_text)  # True

text here starts with rare (≥ 1/10 000 à < 1/1 000) which is found only on 8th page in the file.

UPD: It worked ok with version 0.0.16

inacionery commented 1 day ago

+1