mediacloud / metadata-lib

How Media Cloud approaches extracting metadata from online news stories
Apache License 2.0
12 stars 5 forks source link

Include "canonical link" information in mcmetadata.extract if present. #89

Closed philbudne closed 1 month ago

philbudne commented 2 months ago

As discussed, we'd like to investigate using page declared canonical URLs (<link ref="canonical" href="URL">), especially when processing "historical" data that lacks the final/redirected URL, and to have that (meta)data returned to the story-indexer parser by mcmetadata.extract

philbudne commented 2 months ago

Once the new data is available, all that would be needed to have it saved by the story-indexer pipeline would be to add a field of the same name to the Story ContentMetadata sub-object.

philbudne commented 2 months ago

Doing a little digging into what is available in the extractor libraries we use (in the order we prefer them):

(*) I haven't done any actual examination to see if it works.

In practice, we only RARELY call goose, and don't often call boilerpipe.

rahulbot commented 2 months ago

The goal here would be to extract canonical URL values from source or parsed HTML. This might include tags that are children of HTML <head> like:

Since this would require a re-parse or slow-down, one quick-and-dirty approach would be to just canonical_url as one of the items returned when you call extract with include_other_metadata=True: https://github.com/mediacloud/metadata-lib/blob/fa7d2f64e4d0649f33c15a99044fd955becbdef8/mcmetadata/__init__.py#L21-L23

That code passes include_metadata into the extractor, so then it could be handled on an extractor-by-extractor basis to return the extracted value (or extract it ourselves). Trafilatura's source has a good reference implementation of this that operates on an lxml.html. HtmlElement tree, so could perhaps be repurposed via a direct call to that method.