WordPress lazy-loading noscript cleaner broken with libxml2 < 2.9.9

jtojnar commented 3 years ago

With libxml2 2.9.4 (included in Ubuntu 18.04 LTS), Graby’s WordPress lazy-loading noscript cleaner is unable to remove the second image in the noscript text:

<p><img data-lazyloaded="1" src="data:image/svg+xml;base64,PHN2ZyB4bWxucz0iaHR0cDovL3d3dy53My5vcmcvMjAwMC9zdmciIHdpZHRoPSI2MzkiIGhlaWdodD0iNDA4IiB2aWV3Qm94PSIwIDAgNjM5IDQwOCI+PHJlY3Qgd2lkdGg9IjEwMCUiIGhlaWdodD0iMTAwJSIgZmlsbD0iI2NmZDRkYiIvPjwvc3ZnPg==" class="aligncenter size-full wp-image-32079" data-src="https://uxmovement.com/wp-content/uploads/2020/11/layout-scalebadge.png" alt="" width="639" height="408" /><noscript><img class="aligncenter size-full wp-image-32079" src="https://uxmovement.com/wp-content/uploads/2020/11/layout-scalebadge.png" alt="" width="639" height="408" /></noscript></p>

is turned into:

<p><img data-lazyloaded="1" src="https://uxmovement.com/wp-content/uploads/2020/11/layout-scalebadge.png" class="aligncenter size-full wp-image-32079" alt="" width="639" height="408" /></p><noscript>
<p><img class="aligncenter size-full wp-image-32079" src="https://uxmovement.com/wp-content/uploads/2020/11/layout-scalebadge.png" alt="" width="639" height="408" /></p>

It works fine with libxml2 2.9.10 in later versions of Ubuntu, it was likely fixed by https://gitlab.gnome.org/GNOME/libxml2/-/commit/35e83488505d501864826125cfe6a7950d6cba78.

You can reproduce this by running

$ git clone https://github.com/jtojnar/graby-double-images && cd graby-double-images
$ composer install
$ php test.php

on system with libxml2 before 2.9.9, or if you have Nix:

$ $nix-shell --run 'composer install && php test.php'

See https://github.com/fossar/selfoss/issues/1230 for more details.

jtojnar commented 3 years ago

At this point I see these possible solutions:

Recommend to use html5lib instead of libxml but not sure how performant it is.
Try to find out if it is possible to make libxml parse the noscript inside p correctly.
Make the ContentExtractor look for noscript to parent node’s sibling as well.
Ask Ubuntu and other distros to backport the patch since it is trivial,
Do nothing, ask users to upgrade. But Ubuntu 18.04 is supported at least until April 2023 :crying_cat_face:

jtojnar commented 3 years ago

There is also a separate bug in tidy that wraps the img in the noscript in a p, resulting in invalid p > noscript > p nesting but that does not seem to cause issues thanks to another libxml2 bug :woman_shrugging:

jtojnar commented 3 years ago

Apparently, html5lib suffers from this even worse, even with https://github.com/j0k3r/php-readability/pull/60. I thought it might use libxml2 internally but it happens on libxml2 2.9.10 as well:

$graby = new Graby([
    'extractor' => [
        'default_parser' => 'html5lib',
        'allowed_parsers' => ['html5lib'], // Without this it would still use libxml
    ]
], new GuzzleAdapter());

j0k3r / graby

WordPress lazy-loading noscript cleaner broken with libxml2 < 2.9.9 #240