pymupdf / RAG

RAG (Retrieval-Augmented Generation) Chatbot Examples Using PyMuPDF
https://pymupdf.readthedocs.io/en/latest/pymupdf4llm
GNU Affero General Public License v3.0
303 stars 57 forks source link

No module named 'get_text_lines' in pymupdf4llm #15

Closed vzegna closed 4 months ago

vzegna commented 4 months ago

Description of the bug

I am trying to convert a PDF to markdown and I keep getting this error:

    import pymupdf4llm
  File "/home/vzegna/pyvenv/lib/python3.11/site-packages/pymupdf4llm/__init__.py", line 1, in <module>
    from .helpers.pymupdf_rag import to_markdown, IdentifyHeaders
  File "/home/vzegna/pyvenv/lib/python3.11/site-packages/pymupdf4llm/helpers/pymupdf_rag.py", line 46, in <module>
    from get_text_lines import get_raw_lines, is_white
ModuleNotFoundError: No module named 'get_text_lines'

This was working until few days ago, but it stopped after I upgraded packages today. The current pymupdf4llm version I have installed is 0.0.2. I have downgraded to 0.0.1, and the issue does not appear anymore. I assume someone might have inadvertently removed 'get_text_lines' from the code.

How to reproduce the bug

This is what I am doing in my Python code (I extracted the only relevant lines just for reference):

import pymupdf4llm
import pymupdf

pdf_doc = pymupdf.open("/my_pdf_file_path.pdf")
md_text = pymupdf4llm.to_markdown(pdf_doc)

The error is thrown when the code gets to the to_markdown() method.

PyMuPDF version

1.24.4

Operating system

Linux

Python version

3.11

a1ix2 commented 4 months ago

Had to uninstall the fitz package. It's some old package that hasn't been updated since Feb 2017.

Then I applied some ugly hack. First I added dots at lines 46-47 in site-packages/pymupdf4llm/helpers/pymupdf_rag.py like so

from .get_text_lines import get_raw_lines, is_white
from .multi_column import column_boxes

and then changed the first line in site-packages/pymupdf4llm/helpers/get_text_lines.py from import fitz to import pymupdf as fitz.

Still getting an error trying to pymupdf4llm.to_markdown(fn)

File [~/miniconda3/envs/RAG/lib/python3.11/site-packages/pymupdf4llm/helpers/pymupdf_rag.py:356](http://localhost:8888/lab/tree/dev/RAG/~/miniconda3/envs/RAG/lib/python3.11/site-packages/pymupdf4llm/helpers/pymupdf_rag.py#line=355), in to_markdown.<locals>.output_images(text_rect, img_rects)
    351 if text_rect is not None:  # select tables above the text block
    352     for i, img_rect in sorted(
    353         [j for j in img_rects.items() if j[1].y1 <= text_rect.y0],
    354         key=lambda j: (j[1].y1, j[1].x0),
    355     ):
--> 356         pathname = save_image(page, img_rect, i)
    357         this_md += GRAPHICS_TEXT % (pathname, pathname)
    358         del img_rects[i]

NameError: name 'page' is not defined
jamie-lemon commented 4 months ago

It sounds to me like you may have had this "fitz" installed: https://pypi.org/project/fitz/ and some things might have gone out of sync I think if you do:

pip install pymupdf -U

and then:

pip install pymupdf4llm -U

Then it should bring you up to date okay. Your previous code:

import pymupdf4llm
import pymupdf

pdf_doc = pymupdf.open("/my_pdf_file_path.pdf")
md_text = pymupdf4llm.to_markdown(pdf_doc)

works as expected for me.

a1ix2 commented 4 months ago

The above didn't work. I'm using pymupdf4llm-0.0.2, pymupdf-1.24.4, and pymupdfb-1.24.3. Looked at the tar.gz on pypi and it contains the same problems. This is what I had to do and now it works.

In get_text_lines.py

1c1
< import fitz
---
> import pymupdf as fitz

and in pymupdf_rag.py

46,47c46,47
< from get_text_lines import get_raw_lines, is_white
< from multi_column import column_boxes
---
> from .get_text_lines import get_raw_lines, is_white
> from .multi_column import column_boxes
166a167
>       page,
348c349
<     def output_images(text_rect, img_rects):
---
>     def output_images(page, text_rect, img_rects):
422c423
<             md_string += output_images(text_rect, vg_clusters)
---
>             md_string += output_images(page, text_rect, vg_clusters)
425a427
>               page,
437c439
<         md_string += output_images(None, tab_rects)
---
>         md_string += output_images(page, None, tab_rects)
JorjMcKie commented 4 months ago

This has been resolved in v0.0.3.