The current approach for extracting canonical URLs using trafilatura relies on the behaviour of the version of the library we are currently using (1.8.*). This version automatically extracts metadata from HTML pages regardless of whether the with_metadata parameter is specified.
From trafilatura version 1.11.* metadata extraction is skipped by default thus would prevent us from automatically extracting canonical URLS. Thus, we need to explore a different approach for the extractor before any future upgrades.
The current approach for extracting canonical URLs using
trafilatura
relies on the behaviour of the version of the library we are currently using (1.8.*). This version automatically extracts metadata from HTML pages regardless of whether thewith_metadata
parameter is specified.From trafilatura version 1.11.* metadata extraction is skipped by default thus would prevent us from automatically extracting canonical URLS. Thus, we need to explore a different approach for the extractor before any future upgrades.