mediacloud / rss-fetcher

Intelligently fetch lists of URLs from a large collection of RSS Feeds as part of the Media Cloud Directory.
https://search.mediacloud.org/directory
Apache License 2.0
6 stars 6 forks source link

Investigate rss-fetcher returning non-news URLs #44

Open philbudne opened 3 months ago

philbudne commented 3 months ago

rss-fetcher output includes URLs that story-indexer regards as "non-news", both simple domain names (archive.org) and subdomains (xyz.iheart.com):

2024-08-16 18:17:28,180 c9a6a33e93c1 rss-puller INFO: non-news: http://archive.org/details/dlibra.bibliotekaelblaska.pl.92649-2.30732645
2024-08-16 18:17:26,732 c9a6a33e93c1 rss-puller INFO: non-news: https://kentuckynewsnetwork.iheart.com/content/2024-08-16-18-year-old-teen-cowboy-ace-patton-ashford-killed-in-freak-accident/
2024-08-16 18:17:24,066 c9a6a33e93c1 rss-puller INFO: non-news: https://knrs.iheart.com/content/2024-08-16-new-poll-shows-where-harris-trump-stand-in-crucial-swing-state/
2024-08-16 18:17:23,563 c9a6a33e93c1 rss-puller INFO: non-news: https://buckeyecountry105.iheart.com/content/2024-08-16-new-poll-shows-where-harris-trump-stand-in-crucial-swing-state/
2024-08-16 18:17:19,856 c9a6a33e93c1 rss-puller INFO: non-news: https://wgy.iheart.com/content/2024-08-16-boebert-bikini-photo-supporting-colleague-reveals-massive-secret-tattoo/
philbudne commented 1 month ago

story-indexer has a non_news_fqdn function for this. https://github.com/mediacloud/metadata-lib/issues/91 is a request to move that to mc_metadata

philbudne commented 1 month ago

Code is in https://github.com/mediacloud/metadata-lib/pull/93