mediacloud / metadata-lib

How Media Cloud approaches extracting metadata from online news stories
Apache License 2.0
12 stars 5 forks source link

Add option to override canonical_domain #101

Closed vbanos closed 2 days ago

kilemensi commented 1 week ago

I believe what you're looking for is already done here @vbanos

vbanos commented 1 week ago

@kilemensi I know but the problem with the current code is that urls.canonical_domain(final_url) runs every time, even if we override it.

My goal is to avoid running urls.canonical_domain(final_url) if an override is set.

The reasons I want to avoid it are: A) This function uses tldextract which performs a live HTTP request to fetch the latest list of TLDs on init. We don't want that on some cases and there is no way to configure it in this package.

B) If you want to run batch processing on a large number of web pages from the same domain, you can skip this processing to speed up the operation.

Thank you!

kilemensi commented 1 week ago

Oh... got you @vbanos .

Looking at the code though, I don't see anywhere where the value returned by canonical_domain = urls.canonical_domain(final_url) is used before being overridden. May be the right approach would be to remove this line of code altogether? cc @m453h / @philbudne

vbanos commented 1 week ago

It is used when we make the results dict. https://github.com/mediacloud/metadata-lib/pull/101/files#diff-0bef59bb7131dcaa95adabc9fed1341cf25d872d9ef991757709ac5378c9a590R156

Also, something else I just realised: your original comment is about canonical_url and my MR is about canonical_domain.

kilemensi commented 1 week ago

... perfecto @vbanos

pgulley commented 2 days ago

Thanks for this! I'll get this released this evening