matthewwithanm / python-markdownify

Convert HTML to Markdown
MIT License
1.17k stars 140 forks source link

Formatting inside code spans #141

Closed jonatanschroeder closed 2 months ago

jonatanschroeder commented 3 months ago

Consider the following construct:

<code>normal <strong>bold</strong></code>

This is valid HTML, and will cause "bold" to be formatted in bold. markdownify causes the strong tag to be lost, though:

>>> md('<code>normal <strong>bold</strong></code>')
'`normal bold`'
chrispy-snps commented 2 months ago

@jonatanschroeder - Markdown does not support formatting inside code spans or code blocks; characters inside these elements (thus including styling elements) are displayed literally.

jonatanschroeder commented 2 months ago

Indeed, so if the intention is to create markdown that can be converted back to equivalent HTML this loses information, as converting it back will not include that formatting. If that is not the intention then I may have misunderstood the purpose of the library.

chrispy-snps commented 2 months ago

The purpose is to convert HTML to Markdown, but that is an inherently lossy process as Markdown cannot represent everything that HTML can.

There is a "backdoor" in Markdown for such cases where you can intermix HTML and Markdown syntax in a Markdown document, but most of us converting HTML to Markdown are trying to get away from HTML entirely (for simplified text processing, etc.).

jonatanschroeder commented 2 months ago

Noted. I will close the issue then.

chrispy-snps commented 2 months ago

@jonatanschroeder - for what it's worth, we also wish Markdown supported formatting inside code spans and blocks. Our HTML makes heavy use of emphasizing things inside <pre> blocks. When we convert it to Markdown for LLM use, the emphasis is lost.