pymupdf / RAG

RAG (Retrieval-Augmented Generation) Chatbot Examples Using PyMuPDF
https://pymupdf.readthedocs.io/en/latest/pymupdf4llm
GNU Affero General Public License v3.0
518 stars 81 forks source link

Bounding boxes for extracted text #136

Closed simonschoe closed 2 months ago

simonschoe commented 2 months ago

@JorjMcKie Hi there, any chance that it will be possible in the future to obtain bounding boxes for the extracted text elements? That way it would be possible to map the extracted text back onto the original PDF-page, for example, to visualize the chunk. This would be super helpful for endusers. :)

jamie-lemon commented 2 months ago

I think you can do this with text_blocks = page.get_text("blocks") , see: https://pymupdf.readthedocs.io/en/latest/page.html#Page.get_text

JorjMcKie commented 2 months ago

I fully agree with @jamie-lemon 's comment. Otherwise: this is no issue, but rather a Discussions item. Let's not bloat the Issues with sheer questions!