Closed vbanos closed 2 days ago
@kilemensi I know but the problem with the current code is that urls.canonical_domain(final_url)
runs every time, even if we override it.
My goal is to avoid running urls.canonical_domain(final_url)
if an override is set.
The reasons I want to avoid it are:
A) This function uses tldextract
which performs a live HTTP request to fetch the latest list of TLDs on init. We don't want that on some cases and there is no way to configure it in this package.
B) If you want to run batch processing on a large number of web pages from the same domain, you can skip this processing to speed up the operation.
Thank you!
Oh... got you @vbanos .
Looking at the code though, I don't see anywhere where the value returned by canonical_domain = urls.canonical_domain(final_url)
is used before being overridden. May be the right approach would be to remove this line of code altogether? cc @m453h / @philbudne
It is used when we make the results
dict. https://github.com/mediacloud/metadata-lib/pull/101/files#diff-0bef59bb7131dcaa95adabc9fed1341cf25d872d9ef991757709ac5378c9a590R156
Also, something else I just realised: your original comment is about canonical_url
and my MR is about canonical_domain
.
... perfecto @vbanos
Thanks for this! I'll get this released this evening
I believe what you're looking for is already done here @vbanos