Closed HiroshigeAoki closed 1 month ago
Taking a look ...
Special treatment of text written in some mono-spaced font is a unique feature of this package. I would argue that it should either be completely be kept or disabled - not just "code blocks yes/no".
When disabling code blocks, line breaks will no longer exist in the MD output - and thus look awkward in these cases! This is a direct and unavoidable consequence of treating such text in the same way as proportional fonts. In general, line breaks on document pages should not automatically lead to new lines in the MD text, because MD documents actually are a superset of HTML documents. And as such, line breaks are dynamically generated by the renderer / browser.
Apart from code blocks, the package never generates simple line breaks. It will instead use some criteria to start a new paragraph - IAW insert a blank line, which is equivalent to using the HTML tag <p>
.
Add
enable_code_blocks
Option to Control Code Block Formatting in Markdown OutputSummary
This PR introduces a new option
enable_code_blocks
to theto_markdown
function inpymupdf_rag.py
. This option allows users to enable or disable code block formatting in the generated Markdown output. This option is defaulted toFalse
to maintain backward compatibility.Changes
enable_code_blocks
parameter to theto_markdown
function.write_text
function to respect theenable_code_blocks
setting.Motivation
Sometimes, the automatic conversion of text to code blocks in the Markdown output can lead to formatting issues. This new option provides users with the flexibility to control whether code blocks should be included in the output, improving the usability of the tool in various scenarios.
Example
md_output = pymupdf4llm.to_markdown(pdf_path, enable_code_blocks=True)