scrapinghub / extruct

Extract embedded metadata from HTML markup
BSD 3-Clause "New" or "Revised" License
846 stars 113 forks source link

fix: Add Support for lxml >= 5.2.0 #217

Closed michael-genson closed 6 months ago

michael-genson commented 6 months ago

lxml 5.1.0 removed _ElementStringResult which we use during dom traversal. This PR attempts to import _ElementStringResult as usual, but falls back to defining it explicitly if the import fails. This fixes https://github.com/scrapinghub/extruct/issues/215

I chose to attempt to import first to preserve existing behavior if an older version of lxml is installed.

Additionally, we use the lxml HTML cleaner, which is now in its own separate package. Due to this I've added the optional html_clean package to the requirements.txt. As noted below, this raises a warning if using lxml < 5.2.0, but otherwise doesn't impact older versions.

michael-genson commented 6 months ago

Looks like the test failures are related to mf2py, not lxml. Looking into whether we should update the tests (probably) or pin mf2py < 2 (probably not)

michael-genson commented 6 months ago

mf2py released v2 a few months ago which has a single breaking change: it includes image alts in microdata by default (see changelog in release). The reasoning for this is in this mf2py issue, but TL;DR the image alt has been part of the microformat spec for several years, and not including it was considered deprecated behavior (and now is removed entirely).

To accommodate this I've updated the tests, though it's not explicitly related to the lxml update.