mediacloud / metadata-lib

How Media Cloud approaches extracting metadata from online news stories
Apache License 2.0
12 stars 5 forks source link

Revision Needed: Canonical URL extraction method will break with future versions of trafilatura #95

Closed m453h closed 1 month ago

m453h commented 1 month ago

The current approach for extracting canonical URLs using trafilatura relies on the behaviour of the version of the library we are currently using (1.8.*). This version automatically extracts metadata from HTML pages regardless of whether the with_metadata parameter is specified.

From trafilatura version 1.11.* metadata extraction is skipped by default thus would prevent us from automatically extracting canonical URLS. Thus, we need to explore a different approach for the extractor before any future upgrades.

philbudne commented 1 month ago

Maybe we should add a comment/warning in requirements.txt about trafilatura versions!

m453h commented 1 month ago

Sure I will do the needful!