mixmark-io / turndown

🛏 An HTML to Markdown converter written in JavaScript
https://mixmark-io.github.io/turndown
MIT License
8.84k stars 879 forks source link

Keep, does not keep <kbd> #363

Closed Offerel closed 3 years ago

Offerel commented 3 years ago

Im using the following code:

let options = {
        headingStyle: 'atx',
        hr: '-',
        bulletListMarker: '-',
        codeBlockStyle: 'fenced',
        fence: '```',
        emDelimiter: '*',
        strongDelimiter: '**',
        linkStyle: 'inlined',
        linkReferenceStyle: 'full',
        collapseMultipleWhitespaces: true,
        preformattedCode: true,
        };
    let turndownService = new window.TurndownService(options);
    turndownService.keep(['kbd', 'ins']);
    console.log(turndownService.turndown(pastedHTML));

But from:

<kbd>STRG</kbd> <code>test</code></p>

i got:

`test`

Is there something, I'm doing wrong?

martincizek commented 3 years ago

Can't reproduce it. Can you double-check the actual contents of your pastedHTML before it is passed to turndown?

With current Turndown version from npm and Node.js:

const TurndownService = require('turndown');
let options = {
  headingStyle: 'atx',
  hr: '-',
  bulletListMarker: '-',
  codeBlockStyle: 'fenced',
  fence: '```',
  emDelimiter: '*',
  strongDelimiter: '**',
  linkStyle: 'inlined',
  linkReferenceStyle: 'full',
  collapseMultipleWhitespaces: true,
  preformattedCode: true,
  };
let turndownService = new TurndownService(options);
turndownService.keep(['kbd', 'ins']);

html = '<kbd>STRG</kbd> <code>test</code>';
htmlUnbalancedPTag = '<kbd>STRG</kbd> <code>test</code></p>';

console.log(turndownService.turndown(html));
// <kbd>STRG</kbd> `test`
console.log(turndownService.turndown(htmlUnbalancedPTag));
// <kbd>STRG</kbd> `test`
Offerel commented 3 years ago

Strange, i have retested this again without to change the code at least for turndow and this time it works as expected. <kbd> is kept.

One additional question to this: It seems logic in the first place, that when i keep <kbd>, that this keeps also inline CSSstyling information. Is there some way to automatically clean the <kbd> tag? Let me first explain, what i try to do: I select some text on a Webpage, which i didn't control. So don't know, what style information is there. After selecting the text, i copy this via STRG+C to the clipboard. No i go to my textarea or editor and paste the clipboard. To this textarea, i have bound turndown with an "onPaste" event. I send clipboardData.getData('text/html'). I want to keep the kbd tag, but clean up all style information. Is there any way with turndown or must i sanitize the tag by myself?

Offerel commented 3 years ago

After trying the lib a little bit, i wonder if my approach of cleaning kbd tag is the right way:

var options = {
    headingStyle: 'atx',
    hr: '-',
    bulletListMarker: '-',
    codeBlockStyle: 'fenced',
    fence: '```',
    emDelimiter: '*',
    strongDelimiter: '**',
    linkStyle: 'inlined',
    linkReferenceStyle: 'full',
    collapseMultipleWhitespaces: true,
    preformattedCode: true,
};

var turndownService = new window.TurndownService(options);
turndownService.keep(['kbd', 'ins']);
turndownService.addRule('kbd',{
    filter:['kbd'],
    replacement: function(content) {
        return '<kbd>' + content + '</kbd>';
    }
});

document.getElementByID('mytextarea').value(turndownService.turndown(pastedHTML));
Offerel commented 3 years ago

BTW, it seems i have found the main issue, that the kbd was not included in the output. If i copy from a webpage from within Firefox (v85.x) the clipBoardData has no kbd tag. If i do the same with Chromium (v88.x), the kbd tag is available. I have no idea, why there is a difference. But it seems, it has nothing to do with you library. Maybe another of this obscure Firefox issues of the last month. I start to hate Firefox a little bit.

martincizek commented 3 years ago

After trying the lib a little bit, i wonder if my approach of cleaning kbd tag is the right way:

Yes, that's correct. And you probably don't want to keep() the kbd tag when you have a rule for it.

The "feature" of the current keep is that it keeps the whole subtree, when a "shallow" keep might match better some use cases. That's something definitely worth investigating as a Turndown's feature, but it's tricky to design it meaningfully and universally at the same time. For example, when I want a "shallow keep" of a table that cannot be converted to MD, I want to shallow copy not only table, but also the nested table-related tags, but not automatically the tags of an another nested table should it be there. :)