mixmark-io / turndown

🛏 An HTML to Markdown converter written in JavaScript
https://mixmark-io.github.io/turndown
MIT License
8.84k stars 879 forks source link

Emphasis before dash in hyphenated words generates invalid output according to CommonMark standards #365

Open catacom opened 3 years ago

catacom commented 3 years ago

Emphasis before dash in hyphenated words generates invalid Markdown output according to CommonMark standards.

Example: mean<em>-spirited</em> generates mean_\-spirited_ which is not to be emphasized according to https://spec.commonmark.org/dingus/

Similar issues occur for strong emphasis.

martincizek commented 3 years ago

This one is difficult to solve, as you have to

  1. either keep it as HTML, i.e. mean<em>-spirited</em>
  2. or intervene in the content and move the dash outside of the emphasis, i.e. mean-_spirited_

The second option is already implemented for ascii and unicode whitespace, where the unicode whitespace is "more preserved" (#315). This might be similar to the unicode whitespace (where mean<em>&nbsp;spirited</em> becomes mean\u00A0_spirited_), but it needs careful analysis regarding other effects. Is this what you suggest?

P.S. I originally thought this is related to intraword emphasis, where using * and ** instead of _ and __ improves the situation.