Open philbudne opened 3 months ago
rss-fetcher output includes URLs that story-indexer regards as "non-news", both simple domain names (archive.org) and subdomains (xyz.iheart.com):
2024-08-16 18:17:28,180 c9a6a33e93c1 rss-puller INFO: non-news: http://archive.org/details/dlibra.bibliotekaelblaska.pl.92649-2.30732645 2024-08-16 18:17:26,732 c9a6a33e93c1 rss-puller INFO: non-news: https://kentuckynewsnetwork.iheart.com/content/2024-08-16-18-year-old-teen-cowboy-ace-patton-ashford-killed-in-freak-accident/ 2024-08-16 18:17:24,066 c9a6a33e93c1 rss-puller INFO: non-news: https://knrs.iheart.com/content/2024-08-16-new-poll-shows-where-harris-trump-stand-in-crucial-swing-state/ 2024-08-16 18:17:23,563 c9a6a33e93c1 rss-puller INFO: non-news: https://buckeyecountry105.iheart.com/content/2024-08-16-new-poll-shows-where-harris-trump-stand-in-crucial-swing-state/ 2024-08-16 18:17:19,856 c9a6a33e93c1 rss-puller INFO: non-news: https://wgy.iheart.com/content/2024-08-16-boebert-bikini-photo-supporting-colleague-reveals-massive-secret-tattoo/
story-indexer has a non_news_fqdn function for this. https://github.com/mediacloud/metadata-lib/issues/91 is a request to move that to mc_metadata
non_news_fqdn
Code is in https://github.com/mediacloud/metadata-lib/pull/93
rss-fetcher output includes URLs that story-indexer regards as "non-news", both simple domain names (archive.org) and subdomains (xyz.iheart.com):