pymupdf / RAG

RAG (Retrieval-Augmented Generation) Chatbot Examples Using PyMuPDF
https://pymupdf.readthedocs.io/en/latest/pymupdf4llm
GNU Affero General Public License v3.0
518 stars 81 forks source link

Embedded hyperlink doesn't get extracted in markdown mode #153

Closed tkcoding closed 1 month ago

tkcoding commented 1 month ago

Description of the bug

Trying to use pymupdf4llm to extract embedded hyperlink in text , most of the embedded link doesn't get extracted to markdown mode. Is there any method that I can extract the text with embedded hyperlink together?

How to reproduce the bug

Example file : example_document.pdf

import pymupdf4llm
md_text = pymupdf4llm.to_markdown("example_document.pdf")

Screenshot on expected outcome and current markdown produced:

image

library version : PyMuPDFb-1.24.10 pymupdf-1.24.10 pymupdf4llm-0.0.17

PyMuPDF version

1.24.10

Operating system

MacOS

Python version

3.10

JorjMcKie commented 1 month ago

Links inside table cells are not supported yet.