mozilla / readability

A standalone version of the readability lib
Other
8.8k stars 598 forks source link

Feature: Callback like `onRemoveNode` before a node is being removed #856

Open zirkelc opened 6 months ago

zirkelc commented 6 months ago

I have a similar use case #799 where a node is being removed because it the class name contains the header keyword which is matched by REGEXPS.unlikelyCandidates:

https://github.com/mozilla/readability/blob/d64951b44bed21ca0f9e0113afd0761b0e0f9d05/Readability.js#L122-L125

Of course I could fork and adapt the regex. However, I think it would be better if there was a generic and dynamic approach to influence the algorithm. For example a callback that is invoked every time a node is being removed by the algorithm, something like this:

var article = new Readability(document, {
    onRemoveNode: (node) => {
        // get all heading elements inside the node
        const headings = this._getAllNodesWithTag(node, ["h1", "h2", "h3", "h4", "h5", "h6"]).length;

        // remove node only if it doesn't contain any heading elements
        return headings.length === 0;
    }
});

This callback could be invoked directly from _removeAndGetNext:

https://github.com/mozilla/readability/blob/d64951b44bed21ca0f9e0113afd0761b0e0f9d05/Readability.js#L793-L797

If there is any interest in this, I'd willing to submit a PR.

cmkm commented 5 months ago

This seems like a good idea, please do submit a PR!