pymupdf / RAG

RAG (Retrieval-Augmented Generation) Chatbot Examples Using PyMuPDF
https://pymupdf.readthedocs.io/en/latest/pymupdf4llm
GNU Affero General Public License v3.0
237 stars 46 forks source link

Embedded links inside the table are not extracted #42

Open narsandu opened 2 months ago

narsandu commented 2 months ago

When extracting data from a PDF table with embedded links, only the text is captured, not the actual links.

JorjMcKie commented 2 months ago

This is not a bug! It is a feature that may eventually be implemented sometime later. It would have to be implemented in the table module in PyMuPDF. This makes it complicated because the actual link text and the display text would both have to be taken into account. Probably, a reasonable decision would be to fall back to HTML syntax for doing this ...

There is a similar request #21 for doing the same with images, maybe you to take a look.