matthewwithanm / python-markdownify

Convert HTML to Markdown
MIT License
1.04k stars 135 forks source link

Formatting inside code spans #141

Closed jonatanschroeder closed 4 weeks ago

jonatanschroeder commented 1 month ago

Consider the following construct:

<code>normal <strong>bold</strong></code>

This is valid HTML, and will cause "bold" to be formatted in bold. markdownify causes the strong tag to be lost, though:

>>> md('<code>normal <strong>bold</strong></code>')
'`normal bold`'
chrispy-snps commented 4 weeks ago

@jonatanschroeder - Markdown does not support formatting inside code spans or code blocks; characters inside these elements (thus including styling elements) are displayed literally.

jonatanschroeder commented 4 weeks ago

Indeed, so if the intention is to create markdown that can be converted back to equivalent HTML this loses information, as converting it back will not include that formatting. If that is not the intention then I may have misunderstood the purpose of the library.

chrispy-snps commented 4 weeks ago

The purpose is to convert HTML to Markdown, but that is an inherently lossy process as Markdown cannot represent everything that HTML can.

There is a "backdoor" in Markdown for such cases where you can intermix HTML and Markdown syntax in a Markdown document, but most of us converting HTML to Markdown are trying to get away from HTML entirely (for simplified text processing, etc.).

jonatanschroeder commented 4 weeks ago

Noted. I will close the issue then.

chrispy-snps commented 4 weeks ago

@jonatanschroeder - for what it's worth, we also wish Markdown supported formatting inside code spans and blocks. Our HTML makes heavy use of emphasizing things inside <pre> blocks. When we convert it to Markdown for LLM use, the emphasis is lost.