Open zombiecalypse opened 2 years ago
Added a pull request that demonstrates this and other corner cases:
Can we avoid it somehow???
Zero width space and/or Non-breaking space:
<a href="https://bla-bla-bla">​​</a>text-text-text
produce:
[](https://bla-bla-bla)text-text-text
Is there any way to filter out (remove) html with zero visual content? Something like:
turndownService.addRule('al_spaces', {
regexFilter: '<[^<>]+?>[[:space:]]<\/.+?>',
replacement: function (content) {
return ''
}
})
Line break which breaks markdown's markup:
<strong>bla-bla-bla<br></strong> <br>text-text-text
produce:
**bla-bla-bla
**
text-text-text
Is there any way to filter out (remove) all line breaks that precedes the closing tag? Something like:
turndownService.removeAllBefore('<br>', '</*>')
As far as I can tell only \
is affected.
Good to hear that, @zombiecalypse. Maybe this single exception can be added to be handled by rules
with adding span delimiters (_
, *
, __
, **
) before and after <br>
or <br/>
? @Flashwalker, removing
is no good because it should be preserved in the markdown.
My code uses const markdown = convertToMarkdown( article.content.replaceAll('<br></em>', '</em><br>') );
, but that is specific to the formating I encountered in one article:
Implementation here:
https://github.com/mixmark-io/turndown/blob/4499b5c313d30a3189a58fdd74fc4ed4b2428afd/src/commonmark-rules.js#L209
Reproducing example:
turndown("<em>foo<br/></em>") == "_foo \n_"
https://spec.commonmark.org/0.30/#emphasis-and-strong-emphasis
and
This means that
commonmark2html("_foo \n_") = "<p>_foo<br/>_</p>"
, i.e. the<em>
is lost.The same is true for the other possible span delimiters (
*
,__
,**
) and on a leading<br/>
in a span element.As far as I can tell only
<br/>
is affected. While<em><p>foo<p></em>bar
and similar abominations do trip up the context free replacement, they are fortunately not valid html