Error when page contains nothing but a table

simonschoe commented 1 week ago

Hi there, when I create a word document that contains a single table (e.g., with 6 columns and 6 rows) and I insert some dummy text and save it as pdf, to_markdown throws an error if extract_words=True. This is the stack trace:

---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
Cell In[15], [line 5](line=5)
      [2](line=2) from pymupdf4llm import to_markdown
      [4](line=4) pdf_elements = pymupdf.open(stream=bytes, filetype="pdf")
----> [5](line=5) elements = to_markdown(
      [6](line=6)      pdf_elements,
      [7](line=7)      page_chunks=True,
      [8](line=8)      force_text=True,
      [9](line=9)      table_strategy='lines',
     [10](line=10)     show_progress=False,
     [11](line=11)     extract_words=True, # (x0, y0, x1, y1, "word", block_no, line_no, word_no)
     [12](line=12)     #**({'write_images': True, 'dpi': 300, 'image_path': PATH_IMAGES} if INCLUDE_IMAGES else {}),
     [13](line=13) )
     [15](line=15) [e["text"] for e in elements]

File ~\Lib\site-packages\pymupdf4llm\helpers\pymupdf_rag.py:907, in to_markdown(doc, pages, hdr_info, write_images, embed_images, image_path, image_format, image_size_limit, force_text, page_chunks, margins, dpi, page_width, page_height, table_strategy, graphics_limit, fontsize_limit, ignore_code, extract_words, show_progress)
    [905](file:///~/Lib/site-packages/pymupdf4llm/helpers/pymupdf_rag.py:905)     pages = ProgressBar(pages)
    [906](file:///~/Lib/site-packages/pymupdf4llm/helpers/pymupdf_rag.py:906) for pno in pages:
--> [907](file:///~/Lib/site-packages/pymupdf4llm/helpers/pymupdf_rag.py:907)     page_output, images, tables, graphics, words = get_page_output(
    [908](file:///~/Lib/site-packages/pymupdf4llm/helpers/pymupdf_rag.py:908)         doc, pno, margins, textflags
    [909](file:///~/Lib/site-packages/pymupdf4llm/helpers/pymupdf_rag.py:909)     )
    [910](file:///~/Lib/site-packages/pymupdf4llm/helpers/pymupdf_rag.py:910)     if page_chunks is False:
    [911](file:///~/Lib/site-packages/pymupdf4llm/helpers/pymupdf_rag.py:911)         document_output += page_output

File c:\Entwicklung\.venv_global\Lib\site-packages\pymupdf4llm\helpers\pymupdf_rag.py:881, in to_markdown.<locals>.get_page_output(doc, pno, margins, textflags)
    [878](file:///~/Lib/site-packages/pymupdf4llm/helpers/pymupdf_rag.py:878)             lwords.append(w)
    [879](file:///~/Lib/site-packages/pymupdf4llm/helpers/pymupdf_rag.py:879)     # append sorted words of this line
    [880](file:///~/Lib/site-packages/pymupdf4llm/helpers/pymupdf_rag.py:880)     # words.extend(sorted(lwords, key=lambda w: w[0]))
--> [881](file:///~/Lib/site-packages/pymupdf4llm/helpers/pymupdf_rag.py:881)     words.extend(sort_words(lwords))
    [883](file:///~/Lib/site-packages/pymupdf4llm/helpers/pymupdf_rag.py:883) # remove word duplicates without spoiling the sequence
    [884](file:///~/Lib/site-packages/pymupdf4llm/helpers/pymupdf_rag.py:884) # duplicates may occur for multiple reasons
    [885](file:///~/Lib/site-packages/pymupdf4llm/helpers/pymupdf_rag.py:885) nwords = []  # words w/o duplicates

File c:\Entwicklung\.venv_global\Lib\site-packages\pymupdf4llm\helpers\pymupdf_rag.py:708, in to_markdown.<locals>.sort_words(words)
    [706](file:///~/Lib/site-packages/pymupdf4llm/helpers/pymupdf_rag.py:706) def sort_words(words):
    [707](file:///~/Lib/site-packages/pymupdf4llm/helpers/pymupdf_rag.py:707)     nwords = []
--> [708](file:///~/Lib/site-packages/pymupdf4llm/helpers/pymupdf_rag.py:708)     line = [words[0]]
    [709](file:///~/Lib/site-packages/pymupdf4llm/helpers/pymupdf_rag.py:709)     lrect = pymupdf.Rect(words[0][:4])
    [710](file:///~/Lib/site-packages/pymupdf4llm/helpers/pymupdf_rag.py:710)     for w in words[1:]:

IndexError: list index out of range

JorjMcKie commented 1 week ago

Let me have the example PDF please. No time to make one ...

brandenkmurray commented 1 week ago

@JorjMcKie

pymupdf4llm.to_markdown("./pdfs/oracle-annual-report-2021-22.pdf", extract_words=True, pages=[55]) oracle-annual-report-2021-22.pdf

simonschoe commented 1 week ago

Alternatively: tabel_page.pdf

JorjMcKie commented 1 week ago

Fixed in v0.0.17.

pymupdf / RAG

Error when page contains nothing but a table #147