pymupdf / RAG

RAG (Retrieval-Augmented Generation) Chatbot Examples Using PyMuPDF
https://pymupdf.readthedocs.io/en/latest/pymupdf4llm
GNU Affero General Public License v3.0
302 stars 57 forks source link

Bug in helpers/multi_column.py - IndexError: list index out of range #55

Closed shenyimings closed 3 months ago

shenyimings commented 3 months ago

I encountered an error while processing multi-column PDFs using the pymupdf4llm library. The error occurs in the helpers/multi_column.py file, specifically at line 254. Here is the traceback:

Traceback (most recent call last):
  File "/share/home/sh/code/llmx/analysis.py", line 133, in analyze_dir
    analyze_pdf(os.path.join(root, file), output_dir,config)
  File "/share/home/sh/code/llmx/analysis.py", line 42, in analyze_pdf
    doc = PDFDocumentProcessor(pdf_path, config=config, extension_name=ename)
  File "/share/home/sh/code/llmx/document.py", line 31, in __init__
    self.str_preprocess(pymupdf4llm.to_markdown(pdf_path))
  File "/share/home/sh/code/llmx/pymupdf4llm/helpers/pymupdf_rag.py", line 544, in to_markdown
    page_output, images, tables, graphics = get_page_output(
  File "/share/home/sh/code/llmx/pymupdf4llm/helpers/pymupdf_rag.py", line 499, in get_page_output
    text_rects = column_boxes(
  File "/share/home/sh/code/llmx/pymupdf4llm/helpers/multi_column.py", line 254, in column_boxes
    line0 = b["lines"][0]  # get first line
IndexError: list index out of range

The problematic line is:

line0 = b["lines"][0]  # get first line

This line can throw an IndexError when b["lines"] is an empty list. Could you please modify the code to handle this situation gracefully?

JorjMcKie commented 3 months ago

Solved in version 0.0.6.