mediacloud / metadata-lib

How Media Cloud approaches extracting metadata from online news stories
Apache License 2.0
12 stars 5 forks source link

Add function to detect non-news URLs? #91

Closed philbudne closed 1 month ago

philbudne commented 2 months ago

Both rss-fetcher and story-indexer contain tests for non-news URLs based on the NON_NEWS_DOMAINS list from urls.py

rss-fetcher uses:

tasks.py:            if s.domain in mcmetadata.urls.NON_NEWS_DOMAINS:

which only catches cases where the fully qualified domain name (FQDN) is EXACTLY what appears in NON_NEWS_DOMAINS, while story-indexer has a utility function that also matches anything INSIDE the embargoed domains:

def non_news_fqdn(fqdn: str) -> bool:
    """
    check if a FQDN (fully qualified domain name, ie; DNS name)
    is (in) a domain embargoed as "non-news"

    maybe belongs in  mcmetadata??
    """
    # could be written as "any" on a comprehension:
    # looks like that's 15% slower in Python 3.10,
    # and harder to for me to... comprehend!
    fqdn = fqdn.lower()
    for nnd in NON_NEWS_DOMAINS:
        if fqdn == nnd or fqdn.endswith("." + nnd):
            return True
    return False

I'd like to be able to use this function in rss-fetcher!

NOTE: this code assumes NON_NEWS_DOMAINS is all lower case which is currently.... the case, but that is not enforced/guaranteed, so maybe that could be added as well?!