mediacloud / rss-fetcher

Intelligently fetch lists of URLs from a large collection of RSS Feeds as part of the Media Cloud Directory.
https://search.mediacloud.org/directory
Apache License 2.0
5 stars 5 forks source link

rss-fetcher RSS files contain URLs for non-text files #31

Open philbudne opened 6 months ago

philbudne commented 6 months ago
(venv) pbudne@ifill:~/done$ ls mc*
mc-2023-12-02.rss.gz  mc-2023-12-21.rss.gz  mc-2023-12-25.rss.gz

(venv) pbudne@ifill:~/done$ zegrep -i '<link>.*\.(jpg|gif|jpeg|mp[34]|mpg|pdf).*</link>' mc* | head -10
mc-2023-12-02.rss.gz:<item><link>https://www.corbettreport.com/mp4/nwnw536.mp4</link><pubDate>Fri, 01 Dec 2023 03:13:00 -0000</pubDate><domain>corbettreport.com</domain><title>Right Wing Rising? - #NewWorldNextWeek</title></item>
mc-2023-12-02.rss.gz:<item><link>https://www.corbettreport.com/mp4/nwnw528.mp4</link><pubDate>Fri, 01 Sep 2023 07:03:00 -0000</pubDate><domain>corbettreport.com</domain><title>Back-to-School Adderall Shortage - #NewWorldNextWeek</title></item>
mc-2023-12-02.rss.gz:<item><link>https://www.corbettreport.com/mp4/moric-geopoliticsandempire.mp4</link><pubDate>Tue, 29 Aug 2023 06:03:00 -0000</pubDate><domain>corbettreport.com</domain><title>Geopolitics and Empire with Hrvoje Morić</title></item>
mc-2023-12-02.rss.gz:<item><link>https://www.corbettreport.com/mp4/flashback-tedx.mp4</link><pubDate>Sat, 19 Aug 2023 09:03:00 -0000</pubDate><domain>corbettreport.com</domain><title>See James Corbett's Censored TedX Talk! (2014)</title></item>
mc-2023-12-02.rss.gz:<item><link>https://wiki.d-addicts.com/index.php?title=File:My_Naughty_Assistant.jpg&amp;diff=904812&amp;oldid=0</link><pubDate></pubDate><domain>d-addicts.com</domain><title>File:My Naughty Assistant.jpg</title></item>
mc-2023-12-02.rss.gz:<item><link>https://wiki.d-addicts.com/File:Amazing_Girls.jpg</link><pubDate></pubDate><domain>d-addicts.com</domain><title>File:Amazing Girls.jpg</title></item>
mc-2023-12-02.rss.gz:<item><link>https://www.gifu-np.co.jp/articles/-/320468</link><pubDate>Sat, 02 Dec 2023 00:14:00 -0000</pubDate><domain>gifu-np.co.jp</domain><title>今年のコトバ「迎和」違う立場でも話し合って解決を、住職が揮毫 岐阜・笠松町の福證寺 (岐阜新聞)</title></item>
mc-2023-12-02.rss.gz:<item><link>https://muppet.fandom.com/wiki/File:2023_Sesame_Street_Market_baby_hat_CM.jpg?diff=1616712&amp;oldid=0</link><pubDate></pubDate><domain>fandom.com</domain><title>File:2023 Sesame Street Market baby hat CM.jpg</title></item>
mc-2023-12-02.rss.gz:<item><link>https://muppet.fandom.com/wiki/File:2023_Sesame_Street_Market_baby_hat_2a.jpg?diff=1616711&amp;oldid=0</link><pubDate></pubDate><domain>fandom.com</domain><title>File:2023 Sesame Street Market baby hat 2a.jpg</title></item>
mc-2023-12-02.rss.gz:<item><link>https://muppet.fandom.com/wiki/File:2023_Sesame_Street_Market_baby_hat_1a.jpg?diff=1616710&amp;oldid=0</link><pubDate></pubDate><domain>fandom.com</domain><title>File:2023 Sesame Street Market baby hat 1a.jpg</title></item>

(venv) pbudne@ifill:~/done$ zegrep -i '<link>.*\.(jpg|gif|jpeg|mp[34]|mpg|pdf).*</link>' mc* | wc -l
1312