Include "canonical link" information in mcmetadata.extract if present.

philbudne commented 2 months ago

As discussed, we'd like to investigate using page declared canonical URLs (<link ref="canonical" href="URL">), especially when processing "historical" data that lacks the final/redirected URL, and to have that (meta)data returned to the story-indexer parser by mcmetadata.extract

philbudne commented 2 months ago

Once the new data is available, all that would be needed to have it saved by the story-indexer pipeline would be to add a field of the same name to the Story ContentMetadata sub-object.

philbudne commented 2 months ago

Doing a little digging into what is available in the extractor libraries we use (in the order we prefer them):

trafilatura: might return it, IF we called mcmetadata.extract with include_metadata = True, which we do NOT (and we may not want to, since it looks like it would change what stories are accepted).
readability: "canonical" does NOT appear in library
boilerpy: "canonical" does NOT appear in library
goose: Article class has canonical_link property (*)
newspaper: Article class has canonical_link member (*)
bs4: we get document tree, should be easy to look for link node
lxml: should be easy to look for node?

(*) I haven't done any actual examination to see if it works.

In practice, we only RARELY call goose, and don't often call boilerpipe.

rahulbot commented 2 months ago

The goal here would be to extract canonical URL values from source or parsed HTML. This might include tags that are children of HTML <head> like:

<link rel="canonical" href="https://example.com/dresses/green-dresses" /> (via Google spec)
<meta property="og:url" content="https://example.com/path" /> (via Meta spec)

Since this would require a re-parse or slow-down, one quick-and-dirty approach would be to just canonical_url as one of the items returned when you call extract with include_other_metadata=True: https://github.com/mediacloud/metadata-lib/blob/fa7d2f64e4d0649f33c15a99044fd955becbdef8/mcmetadata/__init__.py#L21-L23

That code passes include_metadata into the extractor, so then it could be handled on an extractor-by-extractor basis to return the extracted value (or extract it ourselves). Trafilatura's source has a good reference implementation of this that operates on an lxml.html. HtmlElement tree, so could perhaps be repurposed via a direct call to that method.

mediacloud / metadata-lib

Include "canonical link" information in mcmetadata.extract if present. #89