pymupdf / RAG

RAG (Retrieval-Augmented Generation) Chatbot Examples Using PyMuPDF
https://pymupdf.readthedocs.io/en/latest/pymupdf4llm
GNU Affero General Public License v3.0
518 stars 81 forks source link

Feature request: inlined base64 images in markdown format #176

Closed sglebs closed 1 week ago

sglebs commented 3 weeks ago

I would like to suggest that the image extraction could also be done in inlined markdown format (controlled by a flag) using base64 instead of an external file. Here's the code:

b64_format = base64.b64encode(png_data)
markdown_format = f"![](data:image/png;base64,{b64_format.decode('utf-8')})"

If you can add the title & alt, even better. Not sure if you have these values in the PDF.

Thanks for listening, thanks for the lib, great job!

JorjMcKie commented 1 week ago

This already supported via the embed_images parameter. Caption, title etc. are not systematically detectable at all and will therefore never be supported.