pymupdf / RAG

RAG (Retrieval-Augmented Generation) Chatbot Examples Using PyMuPDF
https://pymupdf.readthedocs.io/en/latest/pymupdf4llm
GNU Affero General Public License v3.0
303 stars 57 forks source link

Not all PDFs have fontsizes #3

Closed bartdegoede closed 5 months ago

bartdegoede commented 5 months ago

When I was testing with the attached file, PyMuPDF doesn't extract text, and therefor no fontsize is found. which then crashes the program on an IndexError. This change adds a default font size so the program can keep executing, whether it finds text or not.

In [1]: import fitz

In [2]: from helpers.pymupdf_rag import to_markdown

In [3]: doc = fitz.open('XPS-table.pdf')

In [4]: md = to_markdown(doc)
---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
Cell In[4], line 1
----> 1 md = to_markdown(doc)

File ~/projects/RAG/helpers/pymupdf_rag.py:238, in to_markdown(doc, pages)
    235         code = False
    236     return out_string.replace(" \n", "\n")
--> 238 hdr_prefix = IdentifyHeaders(doc, pages=pages)
    239 md_string = ""
    241 for pno in pages:

File ~/projects/RAG/helpers/pymupdf_rag.py:85, in to_markdown.<locals>.IdentifyHeaders.__init__(self, doc, pages, body_limit)
     83 self.header_id = {}
     84 if body_limit is None:  # body text fontsize if not provided
---> 85     body_limit = sorted(
     86         [(k, v) for k, v in fontsizes.items()],
     87         key=lambda i: i[1],
     88         reverse=True,
     89     )[0][0]
     91 sizes = sorted(
     92     [f for f in fontsizes.keys() if f > body_limit], reverse=True
     93 )
     95 # make the header tag dictionary

IndexError: list index out of range

XPS-table.pdf

JorjMcKie commented 5 months ago

Thanks for spotting this! I have fixed the problem.