I have a similar use case #799 where a node is being removed because it the class name contains the header keyword which is matched by REGEXPS.unlikelyCandidates:
Of course I could fork and adapt the regex. However, I think it would be better if there was a generic and dynamic approach to influence the algorithm. For example a callback that is invoked every time a node is being removed by the algorithm, something like this:
var article = new Readability(document, {
onRemoveNode: (node) => {
// get all heading elements inside the node
const headings = this._getAllNodesWithTag(node, ["h1", "h2", "h3", "h4", "h5", "h6"]).length;
// remove node only if it doesn't contain any heading elements
return headings.length === 0;
}
});
This callback could be invoked directly from _removeAndGetNext:
I have a similar use case #799 where a node is being removed because it the class name contains the
header
keyword which is matched byREGEXPS.unlikelyCandidates
:https://github.com/mozilla/readability/blob/d64951b44bed21ca0f9e0113afd0761b0e0f9d05/Readability.js#L122-L125
Of course I could fork and adapt the regex. However, I think it would be better if there was a generic and dynamic approach to influence the algorithm. For example a callback that is invoked every time a node is being removed by the algorithm, something like this:
This callback could be invoked directly from
_removeAndGetNext
:https://github.com/mozilla/readability/blob/d64951b44bed21ca0f9e0113afd0761b0e0f9d05/Readability.js#L793-L797
If there is any interest in this, I'd willing to submit a PR.