pymupdf / RAG

RAG (Retrieval-Augmented Generation) Chatbot Examples Using PyMuPDF
https://pymupdf.readthedocs.io/en/latest/pymupdf4llm
GNU Affero General Public License v3.0
303 stars 57 forks source link

Suggestion: possibility to have a callback on table #33

Closed papipsycho closed 3 months ago

papipsycho commented 4 months ago

Hello,

Sometimes the table are really complex to manage, especially with colspan or rowspan so, i suggest having a possibility to have an event or callback to being able to change the way it writes the table,

@JorjMcKie let me know if you are interested in this, I can help with the integration of this feature

JorjMcKie commented 4 months ago

Table identification as such is a PyMuPDF Page method. Complex table situations that would be represented via colspan etc. in HTML are not representable in that way. Instead you must interpret text and bbox cells having value None accordingly. We are not planning to change that. We are using table.to_markdown() in this repo to output tables in text format. This inevitably exerts restrictions on what can be expressed at all. The only conceivable alternative may be outputting tables instead (or additionally) as pandas DataFrames - as an option only. Dataframes will reflect the None values in cells for deriving complex table structures.