mixmark-io / turndown

🛏 An HTML to Markdown converter written in JavaScript
https://mixmark-io.github.io/turndown
MIT License
8.93k stars 880 forks source link

(addRule) Filter for tags within tags #392

Open NL33 opened 3 years ago

NL33 commented 3 years ago

Thanks for the great work on this package.

I would like to apply a rule to certain tags, only if they are within others. For example, I would like to apply a rule to <p>tags, but only within <td> tags.

Here's the rule I have now:

var TurndownService = require('turndown')
var turndownService = new TurndownService()
turndownService.addRule('ruleName', {  
     filter: 'td',
     replacement: function (content) {
          return '<td>' + content + '</tdr>'
     }
})
...
turndownService.turndown(content)

How would I apply this rule to <p>tags within <td> tags?

I have tried the following, but it has not worked (the rule is not applied):

turndownService.addRule('newRuleName', {  
     filter: 'td p',
     replacement: function (content) {
          return '<td>' + content + '</td>'
     }
})
martincizek commented 3 years ago

This will be easy as soon as we introduce CommonMark contexts, which is a planned major change.

If your intention is to match only the HTML context, then it's still quite easy even with the current version. A slightly modified version can be used to match also the rule (e.g. a HTML table might or might not form a GFM table), but it also requires some more changes to Turndown ATM.

The HTML-only context matching can be something like:

function inHtmlContext(node, selector) {
  let currentNode = node;
  // start at the closest element
  while (currentNode != null && currentNode.nodeType !== 1) {
    currentNode = currentNode.parentElement || currentNode.parentNode;
  }
  return (
    currentNode !== null
    && currentNode.nodeType === 1
    && currentNode.closest(selector) !== null
  );
}

// ...

turndownService.addRule('newRuleName', {  
  filter: function (node) {
    return node.nodeName === 'P' && inHtmlContext(node, 'td');
  },
  replacement: function (content) {
    return '<td>' + content + '</td>'
  }
})
NL33 commented 3 years ago

Thanks for the info you provided. I will try it out.