When I was testing with the attached file, PyMuPDF doesn't extract text, and therefor no fontsize is found. which then crashes the program on an IndexError. This change adds a default font size so the program can keep executing, whether it finds text or not.
In [1]: import fitz
In [2]: from helpers.pymupdf_rag import to_markdown
In [3]: doc = fitz.open('XPS-table.pdf')
In [4]: md = to_markdown(doc)
---------------------------------------------------------------------------
IndexError Traceback (most recent call last)
Cell In[4], line 1
----> 1 md = to_markdown(doc)
File ~/projects/RAG/helpers/pymupdf_rag.py:238, in to_markdown(doc, pages)
235 code = False
236 return out_string.replace(" \n", "\n")
--> 238 hdr_prefix = IdentifyHeaders(doc, pages=pages)
239 md_string = ""
241 for pno in pages:
File ~/projects/RAG/helpers/pymupdf_rag.py:85, in to_markdown.<locals>.IdentifyHeaders.__init__(self, doc, pages, body_limit)
83 self.header_id = {}
84 if body_limit is None: # body text fontsize if not provided
---> 85 body_limit = sorted(
86 [(k, v) for k, v in fontsizes.items()],
87 key=lambda i: i[1],
88 reverse=True,
89 )[0][0]
91 sizes = sorted(
92 [f for f in fontsizes.keys() if f > body_limit], reverse=True
93 )
95 # make the header tag dictionary
IndexError: list index out of range
When I was testing with the attached file, PyMuPDF doesn't extract text, and therefor no fontsize is found. which then crashes the program on an
IndexError
. This change adds a default font size so the program can keep executing, whether it finds text or not.XPS-table.pdf