j0k3r / php-readability

A fork of https://bitbucket.org/fivefilters/php-readability
Apache License 2.0
168 stars 36 forks source link

Readability removes headings when they have a link in them #85

Open kolaente opened 8 months ago

kolaente commented 8 months ago

It looks like Readability removes headlines like this:

<h2 class="header-with-anchor-widget">1. A mental model of the software engineering cycle
    <div id="§a-mental-model-of-the-software-engineering-cycle" class="header-anchor-widget offset-top">
        <div class="header-anchor-widget-button-container">
            <div class="header-anchor-widget-button" href="https://newsletter.pragmaticengineer.com/i/136465585/a-mental-model-of-the-software-engineering-cycle"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="header-anchor-widget-icon"><path d="M10 13a5 5 0 0 0 7.54.54l3-3a5 5 0 0 0-7.07-7.07l-1.72 1.71"></path><path d="M14 11a5 5 0 0 0-7.54-.54l-3 3a5 5 0 0 0 7.07 7.07l1.71-1.71"></path></svg></div>
        </div>
    </div>
</h2>

This is a valid (as in, we want to preserve it as content) headline, it just contains an extra <div>. It seems to be removed in https://github.com/j0k3r/php-readability/blob/38870cdff150e5d50958c721f65615d22472d1fd/src/Readability.php#L900.

Is there a way to control this behaviour? Ideally, Readability would keep the headline but remove the extra div.

Taken from this article, looks like all substack publications use the same markup.