pymupdf / RAG

RAG (Retrieval-Augmented Generation) Chatbot Examples Using PyMuPDF
https://pymupdf.readthedocs.io/en/latest/pymupdf4llm
GNU Affero General Public License v3.0
302 stars 57 forks source link

bug in to_markdown internal function #73

Closed kingennio closed 2 months ago

kingennio commented 2 months ago

Hi, I think I spotted an insidious bug in get_page_output . In line 625, the code reads: md_string += output_images(None, tab_rects, None)

whereas I reckon it should be: md_string += output_images(page, None, vg_clusters)

I had incorrect results when an image in a page (typical on pdf from pptx) does not have any text below it. The extracted text didn't include the image tag, and the image wasn't saved to file. Changing the above line fixed the issue. Thank you for your code BTW!

JorjMcKie commented 2 months ago

Thanks for the report! You are quite right - the line should read md_string += output_images(page, None, img_rects). Will be fixed in next version.

JorjMcKie commented 2 months ago

Fixed with v0.0.10.