mixmark-io / turndown

🛏 An HTML to Markdown converter written in JavaScript
https://mixmark-io.github.io/turndown
MIT License
8.61k stars 870 forks source link

Span rules + br can break commonmark standard #405

Open zombiecalypse opened 2 years ago

zombiecalypse commented 2 years ago

Implementation here:

https://github.com/mixmark-io/turndown/blob/4499b5c313d30a3189a58fdd74fc4ed4b2428afd/src/commonmark-rules.js#L209

Reproducing example: turndown("<em>foo<br/></em>") == "_foo \n_"

https://spec.commonmark.org/0.30/#emphasis-and-strong-emphasis

A single _ character can close emphasis iff it is part of a right-flanking delimiter run and either (a) not part of a left-flanking delimiter run or (b) part of a left-flanking delimiter run followed by a Unicode punctuation character.

and

A right-flanking delimiter run is a delimiter run that is (1) not preceded by Unicode whitespace, and either (2a) not preceded by a Unicode punctuation character, or (2b) preceded by a Unicode punctuation character and followed by Unicode whitespace or a Unicode punctuation character. For purposes of this definition, the beginning and the end of the line count as Unicode whitespace.

This means that commonmark2html("_foo \n_") = "<p>_foo<br/>_</p>", i.e. the <em> is lost.

The same is true for the other possible span delimiters (*, __, **) and on a leading <br/> in a span element.

As far as I can tell only <br/> is affected. While <em><p>foo<p></em>bar and similar abominations do trip up the context free replacement, they are fortunately not valid html

zombiecalypse commented 2 years ago

Added a pull request that demonstrates this and other corner cases:

https://github.com/mixmark-io/turndown/pull/406

Flashwalker commented 1 year ago

Can we avoid it somehow???

1.

Zero width space and/or Non-breaking space: <a href="https://bla-bla-bla">&ZeroWidthSpace;&ZeroWidthSpace;</a>text-text-text produce:

[​​](https://bla-bla-bla)text-text-text

Is there any way to filter out (remove) html with zero visual content? Something like:

turndownService.addRule('al_spaces', {
    regexFilter: '<[^<>]+?>[[:space:]]<\/.+?>',
    replacement: function (content) {
        return ''
    }
})

2.

Line break which breaks markdown's markup: <strong>bla-bla-bla<br></strong>&nbsp;<br>text-text-text produce:

**bla-bla-bla
** 
text-text-text

Is there any way to filter out (remove) all line breaks that precedes the closing tag? Something like:

turndownService.removeAllBefore('<br>', '</*>')

https://github.com/mixmark-io/turndown/issues/423

SARAsBooks commented 1 year ago

As far as I can tell only \
is affected.

Good to hear that, @zombiecalypse. Maybe this single exception can be added to be handled by rules with adding span delimiters (_, *, __, **) before and after <br> or <br/>? @Flashwalker, removing
is no good because it should be preserved in the markdown.

My code uses const markdown = convertToMarkdown( article.content.replaceAll('<br></em>', '</em><br>') );, but that is specific to the formating I encountered in one article:

https://github.com/SARAsBooks/html-to-markdown/blob/04e64d6074bd95903c331d167bb6edc869977986/automationWorkflow.js#L45