pymupdf / RAG

RAG (Retrieval-Augmented Generation) Chatbot Examples Using PyMuPDF
https://pymupdf.readthedocs.io/en/latest/pymupdf4llm
GNU Affero General Public License v3.0
518 stars 81 forks source link

Stuck for multiple panel text PDFs #146

Closed bbfrog closed 1 month ago

bbfrog commented 1 month ago

Here are two example PDFS: IASLC.pdf ENA.pdf

They only have 1 page and multiple panels. When i tried pymupdf4llm.to_markdown, the progress is 1/1 which looks like done, but the program is stuck from there. They maybe hard cases, please let me know whether they can be fixed. Thanks!

JorjMcKie commented 1 month ago

Running this script works in both cases:

import pymupdf4llm
import sys
import pathlib

filename = sys.argv[1]

md = pymupdf4llm.to_markdown(filename, margins=0)
pathlib.Path(filename + ".md").write_bytes(md.encode())

It runs a while of course, because both pages contain more than 1000 drawings.

bbfrog commented 1 month ago

thanks very much!