Open jtojnar opened 3 years ago
At this point I see these possible solutions:
html5lib
instead of libxml
but not sure how performant it is.noscript
inside p
correctly.ContentExtractor
look for noscript
to parent node’s sibling as well.There is also a separate bug in tidy that wraps the img
in the noscript
in a p
, resulting in invalid p > noscript > p
nesting but that does not seem to cause issues thanks to another libxml2 bug :woman_shrugging:
Apparently, html5lib suffers from this even worse, even with https://github.com/j0k3r/php-readability/pull/60. I thought it might use libxml2 internally but it happens on libxml2 2.9.10 as well:
$graby = new Graby([
'extractor' => [
'default_parser' => 'html5lib',
'allowed_parsers' => ['html5lib'], // Without this it would still use libxml
]
], new GuzzleAdapter());
With libxml2 2.9.4 (included in Ubuntu 18.04 LTS), Graby’s WordPress lazy-loading noscript cleaner is unable to remove the second image in the noscript text:
is turned into:
It works fine with libxml2 2.9.10 in later versions of Ubuntu, it was likely fixed by https://gitlab.gnome.org/GNOME/libxml2/-/commit/35e83488505d501864826125cfe6a7950d6cba78.
You can reproduce this by running
on system with libxml2 before 2.9.9, or if you have Nix:
See https://github.com/fossar/selfoss/issues/1230 for more details.