j0k3r / graby

Graby helps you extract article content from web pages
MIT License
363 stars 73 forks source link

WordPress lazy-loading noscript cleaner broken with libxml2 < 2.9.9 #240

Open jtojnar opened 3 years ago

jtojnar commented 3 years ago

With libxml2 2.9.4 (included in Ubuntu 18.04 LTS), Graby’s WordPress lazy-loading noscript cleaner is unable to remove the second image in the noscript text:

<p><img data-lazyloaded="1" src="data:image/svg+xml;base64,PHN2ZyB4bWxucz0iaHR0cDovL3d3dy53My5vcmcvMjAwMC9zdmciIHdpZHRoPSI2MzkiIGhlaWdodD0iNDA4IiB2aWV3Qm94PSIwIDAgNjM5IDQwOCI+PHJlY3Qgd2lkdGg9IjEwMCUiIGhlaWdodD0iMTAwJSIgZmlsbD0iI2NmZDRkYiIvPjwvc3ZnPg==" class="aligncenter size-full wp-image-32079" data-src="https://uxmovement.com/wp-content/uploads/2020/11/layout-scalebadge.png" alt="" width="639" height="408" /><noscript><img class="aligncenter size-full wp-image-32079" src="https://uxmovement.com/wp-content/uploads/2020/11/layout-scalebadge.png" alt="" width="639" height="408" /></noscript></p>

is turned into:

<p><img data-lazyloaded="1" src="https://uxmovement.com/wp-content/uploads/2020/11/layout-scalebadge.png" class="aligncenter size-full wp-image-32079" alt="" width="639" height="408" /></p><noscript>
<p><img class="aligncenter size-full wp-image-32079" src="https://uxmovement.com/wp-content/uploads/2020/11/layout-scalebadge.png" alt="" width="639" height="408" /></p>

It works fine with libxml2 2.9.10 in later versions of Ubuntu, it was likely fixed by https://gitlab.gnome.org/GNOME/libxml2/-/commit/35e83488505d501864826125cfe6a7950d6cba78.

You can reproduce this by running

$ git clone https://github.com/jtojnar/graby-double-images && cd graby-double-images
$ composer install
$ php test.php

on system with libxml2 before 2.9.9, or if you have Nix:

$ $nix-shell --run 'composer install && php test.php'

See https://github.com/fossar/selfoss/issues/1230 for more details.

jtojnar commented 3 years ago

At this point I see these possible solutions:

jtojnar commented 3 years ago

There is also a separate bug in tidy that wraps the img in the noscript in a p, resulting in invalid p > noscript > p nesting but that does not seem to cause issues thanks to another libxml2 bug :woman_shrugging:

jtojnar commented 3 years ago

Apparently, html5lib suffers from this even worse, even with https://github.com/j0k3r/php-readability/pull/60. I thought it might use libxml2 internally but it happens on libxml2 2.9.10 as well:

$graby = new Graby([
    'extractor' => [
        'default_parser' => 'html5lib',
        'allowed_parsers' => ['html5lib'], // Without this it would still use libxml
    ]
], new GuzzleAdapter());