pymupdf / RAG

RAG (Retrieval-Augmented Generation) Chatbot Examples Using PyMuPDF
https://pymupdf.readthedocs.io/en/latest/pymupdf4llm
GNU Affero General Public License v3.0
539 stars 82 forks source link

Table text without new lines #19

Closed papipsycho closed 5 months ago

papipsycho commented 6 months ago

Hello,

i'was testing this project, and i remark that we having issue with new lines inside the table.

i checked a bit and it seems came from the function to_markdown into output_table, i suggest to use .to_pandas().to_markdown() to keep the new line inside a cell.

Best

JorjMcKie commented 5 months ago

Sorry, don't understand what that means. Please provide examples!

papipsycho commented 5 months ago

@JorjMcKie

Sorry for the delay, But here some screenshot with to_markdown() only image_2024_05_27T05_29_37_928Z

with .to_pandas().to_markdown() bimage_2024_05_27T05_30_37_311Z

as you can see we loose the \n

JorjMcKie commented 5 months ago

We will not use DataFrame.to_markdown(), because it unnecessarily introduces external dependencies plus also increases execution time. The markdown output is intended first and above all for serving LLM / RAG upstream applications with an easy to read input format with yet complete data. Its goal is not creating nice-looking markdown versions of input documents primarily. Therefore in this context, loosing line breaks in table cells (and replacing them with spaces) is an affordable loss of appearance quality - not information content.