Closed m453h closed 1 month ago
I was the person who suggested we look at whether "canonical link" tags can help us identify original/preferred page URLs, but I don't know how the task was described.
Full background, since I don't know if you were given it: We have a month of HTML pages from 2021in Amazon S3 that we don't have the original URLs for, AND for ALL "historical" data (page data saved in S3) we have only the original (RSS feed) URL, and not the final (redirected) URL (this matters for page links we got from "aggregators" that published "deep links" to other sites, like news.google.com).
Another piece of background is that MOST extractions are done using
trafilatura
, a few are done withreadability
and only rarely is boilerpy3 called, and the rest, just about never.So it's a bummer that our preferred text extraction method(s) don't give us the canonical link "for free". There had been some discussion about adding an argument to
extract
to force attempted extraction of the canonical link (by additional calls if needed), but again, I'm not sure that was in the scope of work you were given!
Thanks @philbudne for adding more context to this, I had seen the numerous extractors and stared to investigate the TrafilaturaExtractor
class. I was unsure if other extractors would also need to return the canonical URL that's why I also made the implementation for BeautifulSoup
, I was under the impression that I would need to look at the possibility of implementing the same feature for the remaining extractors (when possible to get a bit of consistency), but now I see that this is not necessary.
For the case of the TrafilaturaExtractor
class which was my main focus, unless I have overlooked something... but since we are using the trafilatura.bare_extraction
method, I believe that we can get the canonical URL as it is always returned by this method (if it exists).
i.e. The trafilatura.bare_extraction
calls extract_metadata
which in turn calls the extract_url
, that retrieves the canonical URL. We can access the canonical URL using the url
key from the dictionary returned by the trafilatura.bare_extraction
.
One limitation of this approach is that when the canonical URL is absent, the method returns the URL passed as a parameter to trafilatura.bare_extraction
as the canonical URL (I believe returning None
would make sense).
The alternative approach I have tested out which takes most of the ideas shared by @rahulbot here is to load the html_text
using lxml
and directly call extract_url
but it feels too much as we would be converting the html_text
to HtmlElement
twice. However, since we wont be passing the default url in this method then we would get None
if the canonical URL is absent.
I think all extractors should return the data if they can.
I may have been confused, and thought trafilatura would only return the canonical link if only_with_metadata
was passed (which, if we enabled it, could cause problems)
I think all extractors should return the data if they can.
I may have been confused, and thought trafilatura would only return the canonical link if
only_with_metadata
was passed (which, if we enabled it, could cause problems)
Sure, I will work on all the extractors that can return the canonical url....
For the case of trafilatura
from the way we are calling the extraction method, I can confirm that it will return the canonical url, regardless of whether only_with_metadata
is passed or not...this saves us from the pain of having to invoke any separate extraction method
The only concern I have with this approach is that if the canonical url does not exist in our document, then it will default to the original url that was passed via the extraction method. If this behaviour is fine, then we can get the canonical_url without any additional steps.
The only concern I have with this approach is that if the canonical url does not exist in our document, then it will default to the original url that was passed via the extraction method. If this behaviour is fine, then we can get the canonical_url without any additional steps.
This doesn't sound like an issue if
indeed bare_extraction
tries to extract canonical url from the document by default. Looking at the code, doesn't it require with_metadata
to be explicitly set (which we're currently not doing)?
The only concern I have with this approach is that if the canonical url does not exist in our document, then it will default to the original url that was passed via the extraction method. If this behaviour is fine, then we can get the canonical_url without any additional steps.
This doesn't sound like an issue
if
indeedbare_extraction
tries to extract canonical url from the document by default. Looking at the code, doesn't it requirewith_metadata
to be explicitly set (which we're currently not doing)?
mmh sorry for the confusion @kilemensi , I had to confirm if we are looking at the same code, the links shared on my comment are pointing to the latest trafilatura
codebase.... but as per project dependencies we are currently using v1.8.*, in which we have this part in the implementation of bare_extraction
that I was using to get the canonical url with minimal cost 👇
# extract metadata if necessary
if output_format != 'txt':
if not date_extraction_params:
date_extraction_params = {
"extensive_search": config.getboolean('DEFAULT', 'EXTENSIVE_DATE_SEARCH'),
}
document = extract_metadata(tree, url, date_extraction_params, no_fallback, author_blacklist)
We would always easily extract the url because we are using the default output format (python
) that will ensure extract_metadata
will always be called and thus extract the canonical_url
.
Since any future bump of trafilatura
will break the extraction, perhaps I should just be more explicit on how we are extracting the canonical url though it will come with some extra cost in parsing the html_text
to build the HtmlElement
Since any future bump of
trafilatura
will break the extraction, perhaps I should just be more explicit on how we are extracting the canonical url though it will come with some extra cost in parsing thehtml_text
to build theHtmlElement
All good @m453h Not sure why trafilatura released such a massive breaking change between minor version but I think this is something we need to fully discuss/document before proceeding.
This pull request adds the ability to extract canonical link information in
mcmetadata.extract
if present. This feature has been added to the extractors that use the following libraries:Addresses #89