zombiecalypse commented 2 years ago

Implementation here:

https://github.com/mixmark-io/turndown/blob/4499b5c313d30a3189a58fdd74fc4ed4b2428afd/src/commonmark-rules.js#L209

Reproducing example: turndown("foo ") == "_foo \n_"

https://spec.commonmark.org/0.30/#emphasis-and-strong-emphasis

A single _ character can close emphasis iff it is part of a right-flanking delimiter run and either (a) not part of a left-flanking delimiter run or (b) part of a left-flanking delimiter run followed by a Unicode punctuation character.

and

A right-flanking delimiter run is a delimiter run that is (1) not preceded by Unicode whitespace, and either (2a) not preceded by a Unicode punctuation character, or (2b) preceded by a Unicode punctuation character and followed by Unicode whitespace or a Unicode punctuation character. For purposes of this definition, the beginning and the end of the line count as Unicode whitespace.

This means that commonmark2html("_foo \n_") = "_foo _", i.e. the  is lost.

The same is true for the other possible span delimiters (*, __, **) and on a leading   in a span element.

As far as I can tell only   is affected. While foobar and similar abominations do trip up the context free replacement, they are fortunately not valid html

zombiecalypse commented 2 years ago

Added a pull request that demonstrates this and other corner cases:

https://github.com/mixmark-io/turndown/pull/406

Flashwalker commented 1 year ago

Can we avoid it somehow???

1.

Zero width space and/or Non-breaking space: <a href="https://bla-bla-bla">&ZeroWidthSpace;&ZeroWidthSpace;</a>text-text-text produce:

[](https://bla-bla-bla)text-text-text

Is there any way to filter out (remove) html with zero visual content? Something like:

turndownService.addRule('al_spaces', {
    regexFilter: '<[^<>]+?>[[:space:]]<\/.+?>',
    replacement: function (content) {
        return ''
    }
})

2.

Line break which breaks markdown's markup: bla-bla-bla   text-text-text produce:

**bla-bla-bla
** 
text-text-text

Is there any way to filter out (remove) all line breaks that precedes the closing tag? Something like:

turndownService.removeAllBefore('<br>', '</*>')

https://github.com/mixmark-io/turndown/issues/423

SARAsBooks commented 1 year ago

As far as I can tell only \
is affected.

Good to hear that, @zombiecalypse. Maybe this single exception can be added to be handled by rules with adding span delimiters (_, *, __, **) before and after   or  ? @Flashwalker, removing
is no good because it should be preserved in the markdown.

My code uses const markdown = convertToMarkdown( article.content.replaceAll(' ', ' ') );, but that is specific to the formating I encountered in one article:

https://github.com/SARAsBooks/html-to-markdown/blob/04e64d6074bd95903c331d167bb6edc869977986/automationWorkflow.js#L45

mixmark-io / turndown

Span rules + br can break commonmark standard #405

1.

2.