Closed vzegna closed 4 months ago
Had to uninstall the fitz package. It's some old package that hasn't been updated since Feb 2017.
Then I applied some ugly hack. First I added dots at lines 46-47 in site-packages/pymupdf4llm/helpers/pymupdf_rag.py
like so
from .get_text_lines import get_raw_lines, is_white
from .multi_column import column_boxes
and then changed the first line in site-packages/pymupdf4llm/helpers/get_text_lines.py
from import fitz
to import pymupdf as fitz
.
Still getting an error trying to pymupdf4llm.to_markdown(fn)
File [~/miniconda3/envs/RAG/lib/python3.11/site-packages/pymupdf4llm/helpers/pymupdf_rag.py:356](http://localhost:8888/lab/tree/dev/RAG/~/miniconda3/envs/RAG/lib/python3.11/site-packages/pymupdf4llm/helpers/pymupdf_rag.py#line=355), in to_markdown.<locals>.output_images(text_rect, img_rects)
351 if text_rect is not None: # select tables above the text block
352 for i, img_rect in sorted(
353 [j for j in img_rects.items() if j[1].y1 <= text_rect.y0],
354 key=lambda j: (j[1].y1, j[1].x0),
355 ):
--> 356 pathname = save_image(page, img_rect, i)
357 this_md += GRAPHICS_TEXT % (pathname, pathname)
358 del img_rects[i]
NameError: name 'page' is not defined
It sounds to me like you may have had this "fitz" installed: https://pypi.org/project/fitz/ and some things might have gone out of sync I think if you do:
pip install pymupdf -U
and then:
pip install pymupdf4llm -U
Then it should bring you up to date okay. Your previous code:
import pymupdf4llm
import pymupdf
pdf_doc = pymupdf.open("/my_pdf_file_path.pdf")
md_text = pymupdf4llm.to_markdown(pdf_doc)
works as expected for me.
The above didn't work. I'm using pymupdf4llm-0.0.2
, pymupdf-1.24.4
, and pymupdfb-1.24.3
. Looked at the tar.gz on pypi and it contains the same problems. This is what I had to do and now it works.
In get_text_lines.py
1c1
< import fitz
---
> import pymupdf as fitz
and in pymupdf_rag.py
46,47c46,47
< from get_text_lines import get_raw_lines, is_white
< from multi_column import column_boxes
---
> from .get_text_lines import get_raw_lines, is_white
> from .multi_column import column_boxes
166a167
> page,
348c349
< def output_images(text_rect, img_rects):
---
> def output_images(page, text_rect, img_rects):
422c423
< md_string += output_images(text_rect, vg_clusters)
---
> md_string += output_images(page, text_rect, vg_clusters)
425a427
> page,
437c439
< md_string += output_images(None, tab_rects)
---
> md_string += output_images(page, None, tab_rects)
This has been resolved in v0.0.3.
Description of the bug
I am trying to convert a PDF to markdown and I keep getting this error:
This was working until few days ago, but it stopped after I upgraded packages today. The current pymupdf4llm version I have installed is 0.0.2. I have downgraded to 0.0.1, and the issue does not appear anymore. I assume someone might have inadvertently removed 'get_text_lines' from the code.
How to reproduce the bug
This is what I am doing in my Python code (I extracted the only relevant lines just for reference):
The error is thrown when the code gets to the to_markdown() method.
PyMuPDF version
1.24.4
Operating system
Linux
Python version
3.11