mixmark-io / turndown

🛏 An HTML to Markdown converter written in JavaScript
https://mixmark-io.github.io/turndown
MIT License
8.52k stars 864 forks source link

Filter or remove rules to filter/remove by regexp/wildcard #423

Open Flashwalker opened 1 year ago

Flashwalker commented 1 year ago

Can we have filter or remove rules to filter/remove via regexp or wildcard???

E.g.:

1.

Zero width space and/or Non-breaking space: <a href="https://bla-bla-bla">&ZeroWidthSpace;&ZeroWidthSpace;</a>text-text-text produce:

[​​](https://bla-bla-bla)text-text-text

Is there any way to filter out (remove) html with zero visual content? Something like:

turndownService.addRule('al_spaces', {
    regexFilter: '<[^<>]+?>[[:space:]]<\/[^<>]+?>',
    replacement: function (content) {
        return ''
    }
})
List of spaces for reference: Number Character name
\u0020 space
\u00A0 no-break space
\u1680 Ogham space mark
\u180E Mongolian vowel separator
\u2000 en quad
\u2001 em quad
\u2002 en space (nut)
\u2003 em space (mutton)
\u2004 three-per-em space (thick space)
\u2005 four-per-em space (mid space)
\u2006 six-per-em space
\u2007 figure space
\u2008 punctuation space
\u2009 thin space
\u200A hair space
\u200B zero width space
\u202F narrow no-break space
\u205F medium mathematical space
\u3000 ideographic space
\uFEFF zero width no-break space
\uFFFC object replacement Character

2.

Line break which breaks markdown's markup: <strong>bla-bla-bla<br></strong>&nbsp;<br>text-text-text produce:

**bla-bla-bla
** 
text-text-text

Is there any way to filter out (remove) all line breaks that precedes the closing tag? Something like:

turndownService.removeAllBefore('<br>', '</*>')

Here is regex examples:

Remove the anchor with zero-width spaces (you can't see them until you paste it in dev console):

selectedHTML='<i>bla</i><b><a href="https://bla-bla-bla">​​​​​​​</a>text-text-text</b><i>bla</i>'
selectedHTML.replace(/<[^<>]+?>[\u00A0\u1680\u180E\u2000-\u200B\u202F\u205F\u3000\uFEFF\u0020\uFFFC]+<\/[^<>]+?>/gm, '')

Remove the line break that precedes closing tag:

selectedHTML='<i>bla</i><strong>bla-bla-bla<br></strong>&nbsp;<br>text-text-text<i>bla</i>'
selectedHTML.replace(/(<br ?\/?>)+(<\/[^<>]+?>)/gi, '$2')

Swap the line break that precedes closing tag and the closing tag with:

selectedHTML='<i>bla</i><strong>bla-bla-bla<br></strong>&nbsp;<br>text-text-text<i>bla</i>'
selectedHTML.replace(/((<br ?\/?>)+)(<\/[^<>]+?>)/gi, '$3$1')

It would be nice if regex filter will skip the content of code and pre tags.

P.S.
And also:

// Drop anchor html tags which contains only dots, commas
selectedHTML = '<a href="#">,</a>'
selectedHTML.replace(/<a [^<>]+?>[.,]+<\/a>/gim, '')

And

// Drop emoji images, keep emoji unicode (from alt attr)
selectedHTML = '<img src="img-apple-64/1f914.png" class="emoji" alt="🤔">'
selectedHTML.replace(/<img [^<>]+?alt=['"]([\p{Emoji}\u200d]+)['"][^<>]*?\/?>/gimu, '$1')