mixmark-io / turndown

🛏 An HTML to Markdown converter written in JavaScript
https://mixmark-io.github.io/turndown
MIT License
8.52k stars 864 forks source link

keep() and remove() does not work for all HTML tags #455

Open trymeouteh opened 5 months ago

trymeouteh commented 5 months ago

Some tags such as <b> and <i> do not work when passed in the keep() and remove() methods. The tags are still converted to markdown even when it was instructed to either keep the tags or remove the tags.

Here is an example to reproduce this...

Turndown v7.1.3

<script src="node_modules/turndown/dist/turndown.js"></script>

<script>
    const myTurndownA = new TurndownService();
    console.log(myTurndownA.turndown('<b>Hello <i>World</i></b>'));

    myTurndownA.keep(['b', 'i']);
    console.log(myTurndownA.turndown('<b>Hello <i>World</i></b>'));

    const myTurndownB = new TurndownService();
    console.log(myTurndownB.turndown('<b>Hello <i>World</i></b>'));

    myTurndownB.keep(['b', 'i']);
    console.log(myTurndownB.turndown('<b>Hello <i>World</i></b>'));
</script>
hwiorn commented 3 months ago

Because the current implementation uses the pre-defined replacement rules before keep-rules and remove-rules. The fix is changing order of rules simply. But I can't say it's backward compatible.

The same issue is happen when user defined the custom replacement rules of strikethrough(del tag) or underline(ins tag).