pymupdf / RAG

RAG (Retrieval-Augmented Generation) Chatbot Examples Using PyMuPDF
https://pymupdf.readthedocs.io/en/latest/pymupdf4llm
GNU Affero General Public License v3.0
518 stars 81 forks source link

pymupdf4llm markdown function missing first and last line on every page #151

Closed Devvarat closed 1 month ago

Devvarat commented 1 month ago

pymupdf4llm.to_markdown(filepath) to_markdown() function is missing first and last line on page. Underline pymupdf get_text() function works fine and return the complete page text.

JorjMcKie commented 1 month ago

What did you use as margins parameter?

Devvarat commented 1 month ago

Did not use any margins. Can you recommend any specific margins value that might work?

I am using pymupdf4llm to read .pdfs

Devvarat commented 1 month ago

Hi JorjMcKie,

Thanks for the reply. Playing with margins parameter solved our issue. I am trying to markdown resume pdfs and sometime resumes cover all of the page, so setting margins=0 worked for me.

Thanks

JorjMcKie commented 1 month ago

Please consult the documentation: the default margins value is (0, 50, 0, 50), so stripes of height 50 are ignored at top and bottom of each page. Use margins=0 in your case.