Enable extraction of canonical link information

m453h commented 2 months ago

This pull request adds the ability to extract canonical link information in mcmetadata.extract if present. This feature has been added to the extractors that use the following libraries:

trafilatura
beautifulsoup4
Goose3
newspaper
lxml

Addresses #89

m453h commented 2 months ago

I was the person who suggested we look at whether "canonical link" tags can help us identify original/preferred page URLs, but I don't know how the task was described.

Full background, since I don't know if you were given it: We have a month of HTML pages from 2021in Amazon S3 that we don't have the original URLs for, AND for ALL "historical" data (page data saved in S3) we have only the original (RSS feed) URL, and not the final (redirected) URL (this matters for page links we got from "aggregators" that published "deep links" to other sites, like news.google.com).

Another piece of background is that MOST extractions are done using trafilatura, a few are done with readability and only rarely is boilerpy3 called, and the rest, just about never.

So it's a bummer that our preferred text extraction method(s) don't give us the canonical link "for free". There had been some discussion about adding an argument to extract to force attempted extraction of the canonical link (by additional calls if needed), but again, I'm not sure that was in the scope of work you were given!

Thanks @philbudne for adding more context to this, I had seen the numerous extractors and stared to investigate the TrafilaturaExtractor class. I was unsure if other extractors would also need to return the canonical URL that's why I also made the implementation for BeautifulSoup, I was under the impression that I would need to look at the possibility of implementing the same feature for the remaining extractors (when possible to get a bit of consistency), but now I see that this is not necessary.

For the case of the TrafilaturaExtractor class which was my main focus, unless I have overlooked something... but since we are using the trafilatura.bare_extraction method, I believe that we can get the canonical URL as it is always returned by this method (if it exists).

i.e. The trafilatura.bare_extractioncalls extract_metadatawhich in turn calls the extract_url, that retrieves the canonical URL. We can access the canonical URL using the url key from the dictionary returned by the trafilatura.bare_extraction.

One limitation of this approach is that when the canonical URL is absent, the method returns the URL passed as a parameter to trafilatura.bare_extraction as the canonical URL (I believe returning None would make sense).

The alternative approach I have tested out which takes most of the ideas shared by @rahulbot here is to load the html_text using lxml and directly call extract_url but it feels too much as we would be converting the html_text to HtmlElement twice. However, since we wont be passing the default url in this method then we would get None if the canonical URL is absent.

philbudne commented 2 months ago

I think all extractors should return the data if they can.

I may have been confused, and thought trafilatura would only return the canonical link if only_with_metadata was passed (which, if we enabled it, could cause problems)

m453h commented 2 months ago

I think all extractors should return the data if they can.

I may have been confused, and thought trafilatura would only return the canonical link if only_with_metadata was passed (which, if we enabled it, could cause problems)

Sure, I will work on all the extractors that can return the canonical url....

For the case of trafilatura from the way we are calling the extraction method, I can confirm that it will return the canonical url, regardless of whether only_with_metadata is passed or not...this saves us from the pain of having to invoke any separate extraction method

The only concern I have with this approach is that if the canonical url does not exist in our document, then it will default to the original url that was passed via the extraction method. If this behaviour is fine, then we can get the canonical_url without any additional steps.

kilemensi commented 2 months ago

The only concern I have with this approach is that if the canonical url does not exist in our document, then it will default to the original url that was passed via the extraction method. If this behaviour is fine, then we can get the canonical_url without any additional steps.

This doesn't sound like an issue if indeed bare_extraction tries to extract canonical url from the document by default. Looking at the code, doesn't it require with_metadata to be explicitly set (which we're currently not doing)?

m453h commented 2 months ago

The only concern I have with this approach is that if the canonical url does not exist in our document, then it will default to the original url that was passed via the extraction method. If this behaviour is fine, then we can get the canonical_url without any additional steps.

This doesn't sound like an issue if indeed bare_extraction tries to extract canonical url from the document by default. Looking at the code, doesn't it require with_metadata to be explicitly set (which we're currently not doing)?

mmh sorry for the confusion @kilemensi , I had to confirm if we are looking at the same code, the links shared on my comment are pointing to the latest trafilatura codebase.... but as per project dependencies we are currently using v1.8.*, in which we have this part in the implementation of bare_extraction that I was using to get the canonical url with minimal cost 👇

        # extract metadata if necessary
        if output_format != 'txt':

            if not date_extraction_params:
                date_extraction_params = {
                    "extensive_search": config.getboolean('DEFAULT', 'EXTENSIVE_DATE_SEARCH'),
                }

            document = extract_metadata(tree, url, date_extraction_params, no_fallback, author_blacklist)

We would always easily extract the url because we are using the default output format (python) that will ensure extract_metadata will always be called and thus extract the canonical_url .

Since any future bump of trafilatura will break the extraction, perhaps I should just be more explicit on how we are extracting the canonical url though it will come with some extra cost in parsing the html_text to build the HtmlElement

kilemensi commented 2 months ago

Since any future bump of trafilatura will break the extraction, perhaps I should just be more explicit on how we are extracting the canonical url though it will come with some extra cost in parsing the html_text to build the HtmlElement

All good @m453h Not sure why trafilatura released such a massive breaking change between minor version but I think this is something we need to fully discuss/document before proceeding.

mediacloud / metadata-lib

Enable extraction of canonical link information #92