Closed philbudne closed 1 month ago
Once the new data is available, all that would be needed to have it saved by the story-indexer pipeline would be to add a field of the same name to the Story ContentMetadata sub-object.
Doing a little digging into what is available in the extractor libraries we use (in the order we prefer them):
include_metadata = True
, which we do NOT (and we may not want to, since it looks like it would change what stories are accepted).Article
class has canonical_link
property (*)Article
class has canonical_link
member (*)(*) I haven't done any actual examination to see if it works.
In practice, we only RARELY call goose, and don't often call boilerpipe.
The goal here would be to extract canonical URL values from source or parsed HTML. This might include tags that are children of HTML <head>
like:
<link rel="canonical" href="https://example.com/dresses/green-dresses" />
(via Google spec)<meta property="og:url" content="https://example.com/path" />
(via Meta spec)Since this would require a re-parse or slow-down, one quick-and-dirty approach would be to just canonical_url
as one of the items returned when you call extract
with include_other_metadata=True
:
https://github.com/mediacloud/metadata-lib/blob/fa7d2f64e4d0649f33c15a99044fd955becbdef8/mcmetadata/__init__.py#L21-L23
That code passes include_metadata
into the extractor, so then it could be handled on an extractor-by-extractor basis to return the extracted value (or extract it ourselves). Trafilatura's source has a good reference implementation of this that operates on an lxml.html. HtmlElement
tree, so could perhaps be repurposed via a direct call to that method.
As discussed, we'd like to investigate using page declared canonical URLs (
<link ref="canonical" href="URL">
), especially when processing "historical" data that lacks the final/redirected URL, and to have that (meta)data returned to the story-indexer parser bymcmetadata.extract