Both rss-fetcher and story-indexer contain tests for non-news URLs based on the NON_NEWS_DOMAINS list from urls.py
rss-fetcher uses:
tasks.py: if s.domain in mcmetadata.urls.NON_NEWS_DOMAINS:
which only catches cases where the fully qualified domain name (FQDN) is EXACTLY what appears in NON_NEWS_DOMAINS, while story-indexer has a utility function that also matches anything INSIDE the embargoed domains:
def non_news_fqdn(fqdn: str) -> bool:
"""
check if a FQDN (fully qualified domain name, ie; DNS name)
is (in) a domain embargoed as "non-news"
maybe belongs in mcmetadata??
"""
# could be written as "any" on a comprehension:
# looks like that's 15% slower in Python 3.10,
# and harder to for me to... comprehend!
fqdn = fqdn.lower()
for nnd in NON_NEWS_DOMAINS:
if fqdn == nnd or fqdn.endswith("." + nnd):
return True
return False
I'd like to be able to use this function in rss-fetcher!
NOTE: this code assumes NON_NEWS_DOMAINS is all lower case which is currently.... the case, but that is not enforced/guaranteed, so maybe that could be added as well?!
Both rss-fetcher and story-indexer contain tests for non-news URLs based on the NON_NEWS_DOMAINS list from urls.py
rss-fetcher uses:
which only catches cases where the fully qualified domain name (FQDN) is EXACTLY what appears in NON_NEWS_DOMAINS, while story-indexer has a utility function that also matches anything INSIDE the embargoed domains:
I'd like to be able to use this function in rss-fetcher!
NOTE: this code assumes NON_NEWS_DOMAINS is all lower case which is currently.... the case, but that is not enforced/guaranteed, so maybe that could be added as well?!