pymupdf / RAG

RAG (Retrieval-Augmented Generation) Chatbot Examples Using PyMuPDF
https://pymupdf.readthedocs.io/en/latest/pymupdf4llm
GNU Affero General Public License v3.0
302 stars 57 forks source link

Add Option to Control Code Block Formatting in Markdown Output. #98

Closed HiroshigeAoki closed 1 month ago

HiroshigeAoki commented 1 month ago

Add enable_code_blocks Option to Control Code Block Formatting in Markdown Output

Summary

This PR introduces a new option enable_code_blocks to the to_markdown function in pymupdf_rag.py. This option allows users to enable or disable code block formatting in the generated Markdown output. This option is defaulted to False to maintain backward compatibility.

Changes

  1. Code Changes:
    • Added enable_code_blocks parameter to the to_markdown function.
    • Updated the write_text function to respect the enable_code_blocks setting.

Motivation

Sometimes, the automatic conversion of text to code blocks in the Markdown output can lead to formatting issues. This new option provides users with the flexibility to control whether code blocks should be included in the output, improving the usability of the tool in various scenarios.

Example

md_output = pymupdf4llm.to_markdown(pdf_path, enable_code_blocks=True)

jamie-lemon commented 1 month ago

Taking a look ...

JorjMcKie commented 1 month ago

Special treatment of text written in some mono-spaced font is a unique feature of this package. I would argue that it should either be completely be kept or disabled - not just "code blocks yes/no".

When disabling code blocks, line breaks will no longer exist in the MD output - and thus look awkward in these cases! This is a direct and unavoidable consequence of treating such text in the same way as proportional fonts. In general, line breaks on document pages should not automatically lead to new lines in the MD text, because MD documents actually are a superset of HTML documents. And as such, line breaks are dynamically generated by the renderer / browser.

Apart from code blocks, the package never generates simple line breaks. It will instead use some criteria to start a new paragraph - IAW insert a blank line, which is equivalent to using the HTML tag <p>.