mediacloud / rss-fetcher

Intelligently fetch lists of URLs from a large collection of RSS Feeds as part of the Media Cloud Directory.
https://search.mediacloud.org/directory
Apache License 2.0
5 stars 5 forks source link

investigate why npr.org stories aren't making it into database #34

Closed rahulbot closed 4 months ago

rahulbot commented 4 months ago

From a relayed report via @Evan-Leon, the npr.org has valid RSS feeds, but only 5 stories match that domain in the index. Preliminary investigations shared on slack led to the rss-fetcher as a potential point where they disappear, so I'm creating an issue to track notes here.

rahulbot commented 4 months ago

I found News Feed #1992002 and picked a story to trace:

I then downloaded synthetic rss file for mc-2024-02-03.rss and only found line 294493 including the story, from a different domain (NPR syndicates):

<item>
    <link>https://www.kios.org/2024-02-03/why-theres-a-basketball-fan-frenzy-over-iowas-caitlin-clark</link>
    <pubDate>Sat, 03 Feb 2024 20:31:58 -0000</pubDate>
    <domain>kios.org</domain>
    <title>Why there's a basketball fan frenzy over Iowa's Caitlin Clark</title>
</item>
rahulbot commented 4 months ago

@philbudne found found instances of that story in the live rss-fetcher DB, including one from npr.org:

rss_fetcher=# select * from stories where title like '%basketball fan frenzy over Iowa%';
-[ RECORD 1 ]---------+--------------------------------------------------------------------------------------------
id                    | 543540153
feed_id               | 1992002
url                   | https://www.npr.org/2024/02/03/1228858826/caitlin-clark-iowa-basketball-ncaa-frenzy
guid                  | https://www.npr.org/2024/02/03/1228858826/caitlin-clark-iowa-basketball-ncaa-frenzy
published_at          | 2024-02-03 20:31:58
fetched_at            | 2024-02-03 21:42:25.545889
domain                | npr.org
title                 | Why there's a basketball fan frenzy over Iowa's Caitlin Clark
normalized_url        | http://npr.org/2024/02/03/1228858826/caitlin-clark-iowa-basketball-ncaa-frenzy
normalized_title      | why there's a basketball fan frenzy over iowa's caitlin clark
normalized_title_hash | 923f6ed6cc7fd5f5ef1a3da119d92d7e
sources_id            | 1096
-[ RECORD 2 ]---------+--------------------------------------------------------------------------------------------
id                    | 543515710
feed_id               | 2404370
url                   | https://www.kios.org/2024-02-03/why-theres-a-basketball-fan-frenzy-over-iowas-caitlin-clark
guid                  | https://www.kios.org/2024-02-03/why-theres-a-basketball-fan-frenzy-over-iowas-caitlin-clark
published_at          | 2024-02-03 20:31:58
fetched_at            | 2024-02-03 20:46:25.071162
domain                | kios.org
title                 | Why there's a basketball fan frenzy over Iowa's Caitlin Clark
normalized_url        | http://kios.org/2024-02-03/why-theres-a-basketball-fan-frenzy-over-iowas-caitlin-clark
normalized_title      | why there's a basketball fan frenzy over iowa's caitlin clark
normalized_title_hash | 923f6ed6cc7fd5f5ef1a3da119d92d7e
sources_id            | 306237
-[ RECORD 3 ]---------+--------------------------------------------------------------------------------------------
id                    | 543552971
feed_id               | 882355
url                   | https://kjzz.org/content/1870234/why-theres-basketball-fan-frenzy-over-iowas-caitlin-clark
guid                  | 1870234 at https://kjzz.org
published_at          | 2024-02-03 20:31:00
fetched_at            | 2024-02-03 22:20:12.788298
domain                | kjzz.org
title                 | Why there's a basketball fan frenzy over Iowa's Caitlin Clark
normalized_url        | http://kjzz.org/content/1870234/why-theres-basketball-fan-frenzy-over-iowas-caitlin-clark
normalized_title      | why there's a basketball fan frenzy over iowa's caitlin clark
normalized_title_hash | 923f6ed6cc7fd5f5ef1a3da119d92d7e
sources_id            | 199700
-[ RECORD 4 ]---------+--------------------------------------------------------------------------------------------
id                    | 543532661
feed_id               | 2328243
url                   | https://upstract.com/x/1e1aefc8fd48fa7f?ref=rss
guid                  | https://upstract.com/x/1e1aefc8fd48fa7f?ref=rss
published_at          | 2024-02-03 20:44:17
fetched_at            | 2024-02-03 21:26:26.358641
domain                | upstract.com
title                 | Why there's a basketball fan frenzy over Iowa's Caitlin Clark
normalized_url        | http://upstract.com/x/1e1aefc8fd48fa7f
normalized_title      | why there's a basketball fan frenzy over iowa's caitlin clark
normalized_title_hash | 923f6ed6cc7fd5f5ef1a3da119d92d7e
sources_id            | 623040
rahulbot commented 4 months ago

I see 38 stories from npr.org in that day's generated RSS file with that pub date on their URL, so some from that domain made it in (but I don't know which sources/feed they were from). I picked one ("Opinion: Their deaths leave holes that will never be filled") and couldn't find it in the ES index, but did find the syndicated version from kpbs.org in there (story id ae6df224cc7de8692663d76584627388f28de907c4d834e22e03de696a2a32d9).

<item><link>https://www.npr.org/2024/02/03/1196550737/best-of-car-talk-draft-02-03-2024</link><pubDate>Sat, 03 Feb 2024 08:00:52 -0000</pubDate><domain>npr.org</domain><title>#2410: The Costs of Free Help</title></item>
<item><link>https://www.npr.org/2024/02/03/1197962352/faw-emmastone-lsd</link><pubDate>Sat, 03 Feb 2024 08:00:00 -0000</pubDate><domain>npr.org</domain><title>Best Of: Emma Stone / The Birth Of Psychedelic Science</title></item>
<item><link>https://www.npr.org/2024/02/03/1228837009/fire-kenya-nairobi-gas-explosion</link><pubDate></pubDate><domain>npr.org</domain><title>A fire set off by a gas explosion in Kenya kills at least 3 people and injures 280</title></item>
<item><link>https://www.npr.org/2024/02/03/1228592039/biden-south-carolina-primary-2024-election</link><pubDate></pubDate><domain>npr.org</domain><title>South Carolina Democrats hold their primary today. Here’s why it matters</title></item>
<item><link>https://www.npr.org/2024/02/03/1226456554/el-salvador-election-primer-bukele</link><pubDate>Sat, 03 Feb 2024 12:19:00 -0000</pubDate><domain>npr.org</domain><title>El Salvador is poised to reelect its popular but authoritarian president</title></item>
<item><link>https://www.npr.org/2024/02/03/1227587116/guns-historians-rights-control-second-amendment-supreme-court</link><pubDate>Sat, 03 Feb 2024 12:29:03 -0000</pubDate><domain>npr.org</domain><title>In today’s gun rights cases, historians are in hot demand. Here’s why</title></item>
<item><link>https://www.npr.org/2024/02/03/1228843197/flaco-owl-central-park-zoo-escape-mystery</link><pubDate>Sat, 03 Feb 2024 16:43:13 -0000</pubDate><domain>npr.org</domain><title>A year later, Flaco the owl’s escape from the Central Park Zoo remains a mystery</title></item>
<item><link>https://www.npr.org/2024/02/03/1198908478/wait-wait-dont-tell-me-draft-02-03-2024</link><pubDate>Sat, 03 Feb 2024 13:46:54 -0000</pubDate><domain>npr.org</domain><title>WWDTM: Kristen Kish</title></item>
<item><link>https://www.npr.org/2024/02/03/1228839437/a-year-after-the-ohio-train-derailment-experts-still-worry-about-toxins-it-relea</link><pubDate>Sat, 03 Feb 2024 13:59:32 -0000</pubDate><domain>npr.org</domain><title>A year after the Ohio train derailment, experts still worry about toxins it released</title></item>
<item><link>https://www.npr.org/2024/02/03/1228839430/texas-national-guard-takes-over-city-park-blocks-federal-agents-from-operating-t</link><pubDate>Sat, 03 Feb 2024 13:59:32 -0000</pubDate><domain>npr.org</domain><title>Texas National Guard takes over city park, blocks federal agents from operating there</title></item>
<item><link>https://www.npr.org/2024/02/03/1228839423/former-u-s-ambassador-to-lebanon-on-the-significance-of-the-retaliatory-strikes</link><pubDate>Sat, 03 Feb 2024 13:59:32 -0000</pubDate><domain>npr.org</domain><title>Former U.S. ambassador to Lebanon on the significance of the retaliatory strikes</title></item>
<item><link>https://www.npr.org/2024/02/03/1228829895/california-atmospheric-river-floods-forecast</link><pubDate>Sat, 03 Feb 2024 13:38:57 -0000</pubDate><domain>npr.org</domain><title>Atmospheric river expected to bring life-threatening floods to Southern California</title></item>
<item><link>https://www.npr.org/2024/02/03/1228839402/sara-hill-becomes-first-indigenous-woman-to-serve-on-federal-bench-in-oklahoma</link><pubDate>Sat, 03 Feb 2024 13:00:22 -0000</pubDate><domain>npr.org</domain><title>Sara Hill becomes first Indigenous woman to serve on federal bench in Oklahoma</title></item>
<item><link>https://www.npr.org/2024/02/03/1228839374/catching-up-on-the-latest-development-in-trumps-trial-with-da-fani-willis</link><pubDate>Sat, 03 Feb 2024 13:00:22 -0000</pubDate><domain>npr.org</domain><title>Catching up on the latest development in Trump's trial with DA Fani Willis</title></item>
<item><link>https://www.npr.org/2024/02/03/1228765431/opinion-their-deaths-leave-holes-that-will-never-be-filled</link><pubDate>Sat, 03 Feb 2024 13:00:22 -0000</pubDate><domain>npr.org</domain><title>Opinion: Their deaths leave holes that will never be filled</title></item>
<item><link>https://www.npr.org/2024/02/03/1228844747/northern-ireland-sinn-fein-michelle-oneill-government</link><pubDate>Sat, 03 Feb 2024 15:12:47 -0000</pubDate><domain>npr.org</domain><title>For the first time, an Irish nationalist leads Northern Ireland’s government</title></item>
<item><link>https://www.npr.org/2024/02/03/1228839416/the-u-s-hit-over-85-iran-linked-targets-in-retaliatory-strikes</link><pubDate>Sat, 03 Feb 2024 13:59:32 -0000</pubDate><domain>npr.org</domain><title>The U.S. hit over 85 Iran-linked targets in retaliatory strikes</title></item>
<item><link>https://www.npr.org/2024/02/03/1228839444/saturday-sports-nfl-and-gambling-nhl-allows-players-to-compete-in-the-olympics</link><pubDate>Sat, 03 Feb 2024 13:59:32 -0000</pubDate><domain>npr.org</domain><title>Saturday Sports: NFL and gambling, NHL allows players to compete in the Olympics</title></item>
<item><link>https://www.npr.org/2024/02/03/1228839451/how-one-neighborhood-in-colombia-is-tackling-climate-change-at-the-community-lev</link><pubDate>Sat, 03 Feb 2024 13:59:32 -0000</pubDate><domain>npr.org</domain><title>How one neighborhood in Colombia is tackling climate change at the community level</title></item>
<item><link>https://www.npr.org/2024/02/03/1228839458/how-one-maryland-phone-box-turned-into-a-work-of-art-connecting-people-to-nature</link><pubDate>Sat, 03 Feb 2024 13:59:32 -0000</pubDate><domain>npr.org</domain><title>How one Maryland phone box turned into a work of art connecting people to nature</title></item>
<item><link>https://www.npr.org/2024/02/03/1228839465/c-l-miller-on-her-debut-mystery-novel-and-growing-up-in-the-antiques-business</link><pubDate>Sat, 03 Feb 2024 13:59:32 -0000</pubDate><domain>npr.org</domain><title>C.L. Miller on her debut mystery novel and growing up in the antiques business</title></item>
<item><link>https://www.npr.org/2024/02/03/1228839423/former-u-s-ambassador-to-lebanon-on-the-significance-of-the-retaliatory-strikes?ft=nprml&amp;f=</link><pubDate>Sat, 03 Feb 2024 13:59:32 -0000</pubDate><domain>npr.org</domain><title>Former U.S. ambassador to Lebanon on the significance of the retaliatory strikes</title></item>
<item><link>https://www.npr.org/2024/02/03/1228839430/texas-national-guard-takes-over-city-park-blocks-federal-agents-from-operating-t?ft=nprml&amp;f=</link><pubDate>Sat, 03 Feb 2024 13:59:32 -0000</pubDate><domain>npr.org</domain><title>Texas National Guard takes over city park, blocks federal agents from operating there</title></item>
<item><link>https://www.npr.org/2024/02/03/1228839437/a-year-after-the-ohio-train-derailment-experts-still-worry-about-toxins-it-relea?ft=nprml&amp;f=</link><pubDate>Sat, 03 Feb 2024 13:59:32 -0000</pubDate><domain>npr.org</domain><title>A year after the Ohio train derailment, experts still worry about toxins it released</title></item>
<item><link>https://www.npr.org/2024/02/03/1228839381/with-the-end-of-his-presidential-run-will-desantis-retreat-from-the-culture-wars</link><pubDate>Sat, 03 Feb 2024 13:00:22 -0000</pubDate><domain>npr.org</domain><title>With the end of his presidential run, will DeSantis retreat from the culture wars?</title></item>
<item><link>https://www.npr.org/2024/02/03/1228839388/1-year-later-turkeys-earthquake-victims-live-in-tents-awaiting-permanent-housing</link><pubDate>Sat, 03 Feb 2024 13:00:22 -0000</pubDate><domain>npr.org</domain><title>1 year later, Turkey's earthquake victims live in tents, awaiting permanent housing</title></item>
<item><link>https://www.npr.org/2024/02/03/1228839395/a-business-owner-reflects-on-the-state-of-the-economy-amid-its-soft-landing</link><pubDate>Sat, 03 Feb 2024 13:00:22 -0000</pubDate><domain>npr.org</domain><title>A business owner reflects on the state of the economy amid its 'soft landing'</title></item>
<item><link>https://www.npr.org/2024/02/03/1228839409/the-promised-land-is-a-western-that-follows-a-retired-danish-officer-in-1755</link><pubDate>Sat, 03 Feb 2024 13:00:22 -0000</pubDate><domain>npr.org</domain><title>'The Promised Land' is a western that follows a retired Danish officer in 1755</title></item>
<item><link>https://www.npr.org/2024/02/03/1228839360/u-s-conducts-strikes-in-iraq-and-syria-in-response-to-the-killing-of-3-service-m</link><pubDate>Sat, 03 Feb 2024 13:00:21 -0000</pubDate><domain>npr.org</domain><title>U.S. conducts strikes in Iraq and Syria in response to the killing of 3 service members</title></item>
<item><link>https://www.npr.org/2024/02/03/1228857108/us-strikes-iran-proxies-houthis-yemen</link><pubDate>Sat, 03 Feb 2024 22:21:55 -0000</pubDate><domain>npr.org</domain><title>The U.S. targets Iranian proxies for a second day in a row</title></item>
<item><link>https://www.npr.org/2024/02/03/1228140875/whats-making-us-happy-a-guide-to-your-weekend-viewing-and-listening</link><pubDate>Sat, 03 Feb 2024 12:01:02 -0000</pubDate><domain>npr.org</domain><title>What's Making Us Happy: A guide to your weekend viewing and listening</title></item>
<item><link>https://www.npr.org/2024/02/03/1228857501/iraq-condemns-us-airstrikes-retaliation</link><pubDate>Sat, 03 Feb 2024 18:53:00 -0000</pubDate><domain>npr.org</domain><title>Iraq condemns U.S. airstrikes against Iran-linked groups</title></item>
<item><link>https://www.npr.org/2024/02/03/1228722592/wait-wait-for-february-3-2024-live-from-milwaukee-with-kristen-kish</link><pubDate>Sat, 03 Feb 2024 13:53:07 -0000</pubDate><domain>npr.org</domain><title>'Wait Wait' for February 3, 2024: Live from Milwaukee with Kristen Kish!</title></item>
<item><link>https://www.npr.org/2024/02/03/1228392389/poor-things-emma-stone-benjamin-breen-tripping-on-utopia</link><pubDate>Sat, 03 Feb 2024 10:01:02 -0000</pubDate><domain>npr.org</domain><title>Fresh Air Weekend: Emma Stone; Margaret Mead's influence on the psychedelic era</title></item>
<item><link>https://www.npr.org/2024/02/03/1198910483/up-first-draft-02-03-2024</link><pubDate>Sat, 03 Feb 2024 17:20:37 -0000</pubDate><domain>npr.org</domain><title>U.S. Air Strikes in Middle East, Tech Testimony, Texas Border Dispute</title></item>
<item><link>https://www.npr.org/2024/02/03/1228858826/caitlin-clark-iowa-basketball-ncaa-frenzy</link><pubDate>Sat, 03 Feb 2024 20:31:58 -0000</pubDate><domain>npr.org</domain><title>Why there's a basketball fan frenzy over Iowa's Caitlin Clark</title></item>
<item><link>https://www.npr.org/2024/02/03/1228839367/week-in-politics-biden-in-a-precarious-place-as-he-runs-for-reelection</link><pubDate>Sat, 03 Feb 2024 13:00:22 -0000</pubDate><domain>npr.org</domain><title>Week in politics: Biden in a precarious place as he runs for reelection</title></item>
<item><link>https://www.npr.org/2024/02/03/1227566757/south-carolina-democratic-primary-election-results-2024</link><pubDate>Sat, 03 Feb 2024 05:01:31 -0000</pubDate><domain>npr.org</domain><title>Here are South Carolina’s 2024 Democratic presidential primary results</title></item>
philbudne commented 4 months ago

In the case of the Caitlin Clark article, it looks like the fetch failed 3x:

pbudne@ramos:/srv/data/docker/indexer/worker_data/logs$ zgrep www.npr.org/2024/02/03/1228858826/caitlin-clark messages.log.2024-02-05*
messages.log.2024-02-05_21:2024-02-05 21:42:36,597 a1e8290819c4 fetcher DEBUG: Retrying <GET https://www.npr.org/2024/02/03/1228858826/caitlin-clark-iowa-basketball-ncaa-frenzy> (failed 1 times): User timeout caused connection failure: Getting https://www.npr.org/2024/02/03/1228858826/caitlin-clark-iowa-basketball-ncaa-frenzy took longer than 60.0 seconds..
messages.log.2024-02-05_21:2024-02-05 21:49:44,151 a1e8290819c4 fetcher DEBUG: Retrying <GET https://www.npr.org/2024/02/03/1228858826/caitlin-clark-iowa-basketball-ncaa-frenzy> (failed 2 times): User timeout caused connection failure: Getting https://www.npr.org/2024/02/03/1228858826/caitlin-clark-iowa-basketball-ncaa-frenzy took longer than 60.0 seconds..
messages.log.2024-02-05_21:2024-02-05 21:57:36,659 a1e8290819c4 fetcher ERROR: Gave up retrying <GET https://www.npr.org/2024/02/03/1228858826/caitlin-clark-iowa-basketball-ncaa-frenzy> (failed 3 times): User timeout caused connection failure: Getting https://www.npr.org/2024/02/03/1228858826/caitlin-clark-iowa-basketball-ncaa-frenzy took longer than 60.0 seconds..

My feeling has increasingly been that scrapers and archivers don't mind that their results are "best effort" for low values of "best", since they'll eventually crawl the world over again another day.

My understanding was that scrapy retries failed URLs once the queue of untried ones has emptied (this may be the "long tail" we see). My queue-based fetcher will retry 12x at one-hour intervals, which I'm concerned would STILL not trying hard enough if the target site is unreachable for a day (24 hour downtime might be unheard of for a big commercial site (where downtime is lost money); less so for sites that are not making big money).

In both the batch (scrapy) and current queue-based fetchers, after all fetch attempts fail, the story is dropped (but at least the queue based fetcher "retryx" counter will give an indication of volume).

Even with the queue based fetcher, if UMass were to lose its network connectivity for 12x, we would have to reload the queue with the previous day/file of RSS entries....

philbudne commented 4 months ago

P.S. mea culpa for not thinking of grep'ing the log files before.... I've been working on getting the (original) URL logged every step of the way to better be able to trace where and when things went awry for EXACTLY this sort of auditing!

philbudne commented 4 months ago

Casting a wider net (over three days):

pbudne@ramos:/srv/data/docker/indexer/worker_data/logs$ zgrep 'Gave up retrying <GET https://www.npr.org/' messages.log.2024-02-0[4-6]* | wc -l
1220

About 12% of the total "Gave up" messages in the same period:

pbudne@ramos:/srv/data/docker/indexer/worker_data/logs$ zgrep 'Gave up retrying <GET' messages.log.2024-02-0[4-6]* | wc -l
10330
philbudne commented 4 months ago

Two week count:

pbudne@ramos:/srv/data/docker/indexer/worker_data/logs$ zgrep -a 'Gave up retrying <GET' messages.log.* | wc -l
58013
pbudne@ramos:/srv/data/docker/indexer/worker_data/logs$ ls -lt | tail
-rw-r--r-- 1 root angwin 146550466 Jan 27 19:23 messages.log.2024-01-27_23
rahulbot commented 4 months ago

So perhaps one hypothesis is that npr.org is rejecting our USER_AGENT when we try to fetch the webpage. @NullPxl can you please run a quick python notebook test that tries to fetch the url in quesion with various USER_AGENT strings? The questions is whether npr.org is within the set of domains that rejects custom UA strings.

philbudne commented 4 months ago

Knock me over with a feather, I wouldn't have guessed a UA string would result in "connection error"!! At least with curl, the connection is HTTP/2, which I know little about. In some quick poking, here is a string that's rejected:

pbudne@ramos:/srv/data/docker/indexer/worker_data/logs$ curl -q -H "User-Agent: mediacloud bot for open academic research (mediacloud dot org)"  https://www.npr.org/2024/02/03/1228858826/caitlin-clark-iowa-basketball-ncaa-frenzy >/dev/null
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
curl: (92) HTTP/2 stream 0 was not closed cleanly: INTERNAL_ERROR (err 2)

and a one character change(!!) from that that's accepted:

pbudne@ramos:/srv/data/docker/indexer/worker_data/logs$ curl -q -H "User-Agent: mediacloud not for open academic research (mediacloud dot org)"  https://www.npr.org/2024/02/03/1228858826/caitlin-clark-iowa-basketball-ncaa-frenzy >/dev/null
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 82150    0 82150    0     0   508k      0 --:--:-- --:--:-- --:--:--  507k

But that one character change alone is insufficient!

pbudne@ramos:/srv/data/docker/indexer/worker_data/logs$ curl -q -H "User-Agent: mediacloud not for open academic research (+https://mediacloud.org)"  https://www.npr.org/2024/02/03/1228858826/caitlin-clark-iowa-basketball-ncaa-frenzy >/dev/null
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
curl: (92) HTTP/2 stream 0 was not closed cleanly: INTERNAL_ERROR (err 2)
philbudne commented 4 months ago

Here's a look at npr.org feeds;

rss_fetcher=# select url, last_new_stories, system_status from feeds where url like '%npr.org/%';
                          url                           |      last_new_stories      |   system_status    
--------------------------------------------------------+----------------------------+--------------------
 https://www.npr.org/rss/podcast.php?id=510289          |                            | read timeout
 http://www.npr.org/rss/rss.php?id=1004                 | 2023-01-12 11:30:01.403775 | read timeout
 https://feeds.npr.org/1001/rss.xml                     | 2024-02-10 19:47:20.539661 | Working
 http://www.npr.org/rss/rss.php?id=3                    | 2023-01-12 12:11:22.199187 | read timeout
 https://www.npr.org/rss/rss.php?id=1131                |                            | read timeout
 http://www.npr.org/rss/rss.php?id=1014                 | 2023-01-10 17:57:05.146873 | read timeout
 https://www.wnpr.org/rss.xml                           |                            | HTTP 404 Not Found
 https://www.npr.org/rss/rss.php?id=1031                | 2023-01-11 12:31:35.548783 | read timeout
 http://www.npr.org/rss/rss.php?id=2                    | 2023-01-12 01:00:31.967202 | read timeout
 https://feeds.npr.org/510311/podcast.xml               | 2024-01-23 12:22:56.01259  | Working
 https://www.npr.org/rss/podcast.php?id=510310          | 2023-01-12 02:55:47.041486 | read timeout
 https://feeds.npr.org/344098539/podcast.xml            | 2024-02-10 17:53:55.239454 | Working
 https://feeds.npr.org/510310/podcast.xml               | 2024-02-11 08:29:28.383404 | Working
 https://feeds.npr.org/139482413/rss.xml                | 2024-02-11 19:32:40.268002 | Working
 https://www.npr.org/rss/rss.php?id=1070                | 2023-01-11 23:16:58.904563 | read timeout
 https://www.npr.org/rss/rss.php?id=139482413           |                            | read timeout
 http://www.npr.org/rss/rss.php?id=1001                 | 2023-01-11 18:26:11.652014 | read timeout
 https://www.npr.org/rss/podcast.php?id=510298          |                            | read timeout
 https://feeds.npr.org/1014/rss.xml                     | 2024-02-11 19:36:55.697342 | Working
 http://www.npr.org/rss/podcast.php?id=510005           |                            | read timeout
 https://www.npr.org/rss/rss.php?id=1016                | 2023-01-06 01:48:03.534702 | read timeout
 https://www.npr.org/rss/rss.php?id=1146                |                            | read timeout
 http://www.npr.org/templates/rss/podcast.php?id=510051 | 2023-01-12 03:01:52.963369 | read timeout
 https://www.npr.org/rss/rss.php?id=1059                |                            | read timeout
 https://feeds.npr.org/1032/rss.xml                     | 2024-02-11 20:02:31.161386 | Working
 https://feeds.npr.org/35/rss.xml                       | 2024-02-10 20:07:23.470459 | Working
 https://www.npr.org/rss/podcast.php?id=381444908       | 2023-01-12 03:26:42.988977 | read timeout
 http://www.npr.org/rss/rss.php?id=10                   | 2023-01-08 15:45:13.901882 | read timeout
 http://www.npr.org/rss/rss.php?id=7                    | 2023-01-07 15:50:30.009841 | read timeout
 https://www.npr.org/rss/rss.php?id=1138                | 2023-01-11 04:31:58.102369 | read timeout
 https://www.npr.org/rss/podcast.php?id=510313          |                            | read timeout
 https://feeds.npr.org/510316/podcast.xml               | 2024-02-10 20:33:13.789164 | Working
 https://feeds.npr.org/510051/podcast.xml               | 2024-02-10 20:31:06.126535 | Working
 http://www.npr.org/rss/podcast.php?id=510036           | 2023-01-12 05:52:12.621639 | read timeout
 http://www.npr.org/rss/rss.php?id=1007                 | 2023-01-06 05:00:17.738117 | read timeout
 https://feeds.npr.org/510343/podcast.xml               | 2023-10-26 04:52:27.041031 | Working
 https://www.npr.org/rss/rss.php?id=1128                | 2023-01-10 18:36:50.465636 | read timeout
 https://feeds.npr.org/381444908/podcast.xml            | 2024-02-10 20:56:39.917724 | Working
 https://www.npr.org/rss/rss.php?id=1141                |                            | read timeout
 https://www.npr.org/rss/podcast.php?id=510208          | 2023-01-10 19:06:18.346358 | read timeout
 https://feeds.npr.org/93568166/rss.xml                 | 2024-02-10 21:13:42.697239 | Working
 https://feeds.npr.org/510360/podcast.xml               | 2024-02-06 17:26:21.753976 | Working
 https://www.npr.org/rss/podcast.php?id=510051          |                            | read timeout
 https://feeds.npr.org/510318/podcast.xml               | 2024-02-11 19:42:13.081813 | Working
 https://www.npr.org/rss/podcast.php?id=510308          |                            | read timeout
 https://feeds.npr.org/1003/rss.xml                     | 2024-02-11 18:11:06.315008 | Working
 https://www.npr.org/rss/rss.php?id=1125                | 2023-01-11 20:27:10.138619 | read timeout
 https://blog.apps.npr.org/atom.xml                     | 2023-11-29 02:03:35.199899 | Working
 http://www.npr.org/rss/rss.php?id=1002                 | 2023-01-12 09:11:35.59734  | read timeout
 https://www.npr.org/rss/podcast.php?id=510316          | 2023-01-11 21:11:32.118215 | read timeout
 https://www.npr.org/rss/podcast.php?id=510343          |                            | read timeout
 http://www.npr.org/rss/rss.php?id=35                   | 2023-01-07 20:41:03.867103 | read timeout
 http://www.npr.org/rss/rss.php?id=1032                 | 2023-01-11 21:25:12.928003 | read timeout
 http://www.npr.org/rss/rss.php?id=13                   | 2023-01-11 21:31:48.993356 | read timeout
 https://feeds.npr.org/2/rss.xml                        | 2024-02-11 06:18:35.255247 | Working
 https://www.npr.org/rss/rss.php?id=103537970           |                            | read timeout
 http://www.npr.org/rss/rss.php?id=1003                 | 2023-01-11 21:46:58.187557 | read timeout
 https://feeds.npr.org/510298/podcast.xml               | 2024-02-09 18:17:47.431393 | Working
 https://feeds.npr.org/1002/rss.xml                     | 2023-01-12 10:11:01.088269 | read timeout
 https://feeds.npr.org/500005/podcast.xml               |                            | Working
 https://feeds.npr.org/510313/podcast.xml               |                            | Working
 https://feeds.npr.org/13/rss.xml                       | 2024-02-10 19:16:55.249217 | Working
 https://feeds.npr.org/1039/rss.xml                     | 2024-02-11 19:57:32.490519 | Working
 https://www.npr.org/rss/rss.php?id=1015                | 2023-01-10 23:30:32.50637  | read timeout
 https://www.npr.org/rss/podcast.php?id=510016          | 2023-01-10 11:26:16.003465 | read timeout
 https://feeds.npr.org/510016/podcast.xml               | 2024-01-06 20:08:56.872674 | Working
 http://www.npr.org/rss/rss.php?id=1039                 |                            | read timeout
 https://www.npr.org/rss/podcast.php?id=344098539       |                            | read timeout
 https://feeds.npr.org/510355/podcast.xml               | 2024-02-11 06:12:45.238293 | Working
 https://feeds.npr.org/3/rss.xml                        | 2024-02-09 16:42:25.251791 | Working
 https://feeds.npr.org/1015/rss.xml                     | 2024-02-10 21:09:59.747894 | Working
 https://knpr.org/rss.xml                               |                            | HTTP 404 Not Found
 http://www.npr.org/rss/rss.php?id=1006                 | 2023-01-12 12:41:48.817582 | read timeout
 https://feeds.npr.org/1163/rss.xml                     | 2024-02-01 21:06:03.046008 | Working
 https://www.npr.org/rss/rss.php?id=1013                | 2023-01-12 00:56:43.663892 | read timeout
 https://feeds.npr.org/510308/podcast.xml               | 2023-11-07 01:17:46.673472 | Working
 https://feeds.npr.org/914632053/rss.xml                | 2024-01-01 21:44:15.340855 | Working
 http://www.npr.org/rss/podcast.php?id=510127           | 2023-01-05 12:01:15.651475 | read timeout
 https://feeds.npr.org/1165/rss.xml                     | 2024-01-26 21:00:24.864067 | Working
 https://feeds.npr.org/510289/podcast.xml               | 2024-02-10 07:28:28.207731 | Working
 https://feeds.npr.org/510312/podcast.xml               | 2024-02-09 23:37:40.418817 | Working
NullPxl commented 4 months ago

In line with what Phil found, "bot" and "http(s)://[characters]" in the User Agent lead to the timeout for URLs on npr.org (this works fine: mediacloud, for open academic research (+mediacloud.org)). feeds.npr.org does not have this block in place.

rahulbot commented 4 months ago

Thx. We have our answer - they are rejecting our UA. Closing.