mediacloud / story-indexer

The core pipeline used to ingest online news stories in the Media Cloud archive.
https://mediacloud.org
Apache License 2.0
1 stars 4 forks source link

CBC Stories Not Showing Up in Search Results #258

Closed NullPxl closed 2 months ago

NullPxl commented 3 months ago

In late February I added multiple feeds to the CBC source in the directory (https://search.mediacloud.org/sources/7333). All feeds are working and enabled, and viewing the feed histories shows that the rss-fetcher is not having any issues.

However, a query that I know should contain CBC stories didn't contain any. Testing with a wildcard search within the source also shows no content is being found. I believe this is unique to CBC, since I added multiple feeds to the Globe and Mail (https://search.mediacloud.org/sources/19477) on the same day, and content is found for that source.

image

NullPxl commented 3 months ago

On inspection it looks like this is related to https://github.com/mediacloud/metadata-lib/issues/83 (NPR stories not being added). Both NPR and CBC use Akamai Bot Manager.

@philbudne would you be able to check logs to see if the response from CBC is the same as the response from NPR?

philbudne commented 3 months ago

Here's what I saw in the production logs:

***@***.***:/srv/data/docker/indexer/worker_data/logs# grep -h 'cbc\.ca/' messages.log.2024-03-0* | head -20                   
2024-03-01 02:30:42,329 2651c192a1f6 fetcher DEBUG: Retrying <GET https://www.cbc.ca/news/canada/london/huron-county-leads-swift-rise-in-southwestern-ontario-farmland-values-1.7126396?cmp=rss> (failed 1 times): User timeout caused connection failure: Getting https://www.cbc.ca/news/canada/london/huron-county-leads-swift-rise-in-southwestern-ontario-farmland-values-1.7126396?cmp=rss took longer than 60.0 seconds..
2024-03-01 02:30:47,234 2651c192a1f6 fetcher DEBUG: Retrying <GET https://www.cbc.ca/news/business/leap-day-working-for-free-1.7127313?cmp=rss> (failed 1 times): User timeout caused connection failure: Getting https://www.cbc.ca/news/business/leap-day-working-for-free-1.7127313?cmp=rss took longer than 60.0 seconds..
2024-03-01 02:31:42,334 2651c192a1f6 fetcher DEBUG: Retrying <GET https://www.cbc.ca/news/canada/ontario-medcheck-shoppers-drug-mart-pressure-1.7126811?cmp=rss> (failed 1 times): User timeout caused connection failure: Getting https://www.cbc.ca/news/canada/ontario-medcheck-shoppers-drug-mart-pressure-1.7126811?cmp=rss took longer than 60.0 seconds..
2024-03-01 02:31:47,930 2651c192a1f6 fetcher DEBUG: Retrying <GET https://www.cbc.ca/news/canada/nova-scotia/seniors-call-extensive-landline-outage-safety-concern-1.7126722?cmp=rss> (failed 1 times): User timeout caused connection failure: Getting https://www.cbc.ca/news/canada/nova-scotia/seniors-call-extensive-landline-outage-safety-concern-1.7126722?cmp=rss took longer than 60.0 seconds..
2024-03-01 02:32:42,340 2651c192a1f6 fetcher DEBUG: Retrying <GET https://www.cbc.ca/news/canada/manitoba/cannabis-spirit-rising-foster-1.7127101?cmp=rss> (failed 1 times): User timeout caused connection failure: Getting https://www.cbc.ca/news/canada/manitoba/cannabis-spirit-rising-foster-1.7127101?cmp=rss took longer than 60.0 seconds..
2024-03-01 02:32:48,837 2651c192a1f6 fetcher DEBUG: Retrying <GET https://www.cbc.ca/news/canada/saskatoon/woman-68-grateful-for-rescue-after-falling-in-snow-1.7127470?cmp=rss> (failed 1 times): User timeout caused connection failure: Getting https://www.cbc.ca/news/canada/saskatoon/woman-68-grateful-for-rescue-after-falling-in-snow-1.7127470?cmp=rss took longer than 60.0 seconds..
2024-03-01 02:33:42,345 2651c192a1f6 fetcher DEBUG: Retrying <GET https://www.cbc.ca/news/canada/ottawa/senior-rent-services-fee-increase-ontario-1.7123685?cmp=rss> (failed 1 times): User timeout caused connection failure: Getting https://www.cbc.ca/news/canada/ottawa/senior-rent-services-fee-increase-ontario-1.7123685?cmp=rss took longer than 60.0 seconds..
2024-03-01 02:33:48,841 2651c192a1f6 fetcher DEBUG: Retrying <GET https://www.cbc.ca/news/canada/newfoundland-labrador/nl-5-wing-goose-bay-german-low-level-training-proposal-1.7126646?cmp=rss> (failed 1 times): User timeout caused connection failure: Getting https://www.cbc.ca/news/canada/newfoundland-labrador/nl-5-wing-goose-bay-german-low-level-training-proposal-1.7126646?cmp=rss took longer than 60.0 seconds..
2024-03-01 02:34:42,351 2651c192a1f6 fetcher DEBUG: Retrying <GET https://www.cbc.ca/news/canada/saskatoon/roughriders-saskatchewan-ad-sexist-girl-math-news-cbc-1.7127668?cmp=rss> (failed 1 times): User timeout caused connection failure: Getting https://www.cbc.ca/news/canada/saskatoon/roughriders-saskatchewan-ad-sexist-girl-math-news-cbc-1.7127668?cmp=rss took longer than 60.0 seconds..
2024-03-01 02:34:48,846 2651c192a1f6 fetcher DEBUG: Retrying <GET https://www.cbc.ca/news/canada/edmonton/suspect-in-fatal-u-haul-truck-hit-and-run-crash-arrested-1.7128089?cmp=rss> (failed 1 times): User timeout caused connection failure: Getting https://www.cbc.ca/news/canada/edmonton/suspect-in-fatal-u-haul-truck-hit-and-run-crash-arrested-1.7128089?cmp=rss took longer than 60.0 seconds..
2024-03-01 02:35:42,357 2651c192a1f6 fetcher DEBUG: Retrying <GET https://www.cbc.ca/news/canada/newfoundland-labrador/nlgames-cross-country-ski-finish-1.7127855?cmp=rss> (failed 1 times): User timeout caused connection failure: Getting https://www.cbc.ca/news/canada/newfoundland-labrador/nlgames-cross-country-ski-finish-1.7127855?cmp=rss took longer than 60.0 seconds..
2024-03-01 02:35:48,848 2651c192a1f6 fetcher DEBUG: Retrying <GET https://www.cbc.ca/news/canada/calgary/trans-mountain-expansion-cost-estimates-grow-1.7127619?cmp=rss> (failed 1 times): User timeout caused connection failure: Getting https://www.cbc.ca/news/canada/calgary/trans-mountain-expansion-cost-estimates-grow-1.7127619?cmp=rss took longer than 60.0 seconds..
2024-03-01 02:36:42,363 2651c192a1f6 fetcher DEBUG: Retrying <GET https://www.cbc.ca/news/thenational/the-harsh-reality-of-trying-to-access-ivf-in-canada-1.7127596?cmp=rss> (failed 1 times): User timeout caused connection failure: Getting https://www.cbc.ca/news/thenational/the-harsh-reality-of-trying-to-access-ivf-in-canada-1.7127596?cmp=rss took longer than 60.0 seconds..
2024-03-01 02:36:49,768 2651c192a1f6 fetcher DEBUG: Retrying <GET https://www.cbc.ca/news/canada/manitoba/nhl-commissioner-gary-bettman-winnipeg-jets-1.7127200?cmp=rss> (failed 1 times): User timeout caused connection failure: Getting https://www.cbc.ca/news/canada/manitoba/nhl-commissioner-gary-bettman-winnipeg-jets-1.7127200?cmp=rss took longer than 60.0 seconds..
2024-03-01 02:37:42,369 2651c192a1f6 fetcher DEBUG: Retrying <GET https://www.cbc.ca/news/entertainment/run-dmc-trial-conviction-1.7127370?cmp=rss> (failed 1 times): User timeout caused connection failure: Getting https://www.cbc.ca/news/entertainment/run-dmc-trial-conviction-1.7127370?cmp=rss took longer than 60.0 seconds..
2024-03-01 02:37:49,772 2651c192a1f6 fetcher DEBUG: Retrying <GET https://www.cbc.ca/news/world/free-tuition-for-nyc-med-school-1.7127423?cmp=rss> (failed 1 times): User timeout caused connection failure: Getting https://www.cbc.ca/news/world/free-tuition-for-nyc-med-school-1.7127423?cmp=rss took longer than 60.0 seconds..
2024-03-01 02:38:42,375 2651c192a1f6 fetcher DEBUG: Retrying <GET https://www.cbc.ca/news/canada/london/huron-county-leads-swift-rise-in-southwestern-ontario-farmland-values-1.7126396?cmp=rss> (failed 2 times): User timeout caused connection failure: Getting https://www.cbc.ca/news/canada/london/huron-county-leads-swift-rise-in-southwestern-ontario-farmland-values-1.7126396?cmp=rss took longer than 60.0 seconds..
2024-03-01 02:38:49,776 2651c192a1f6 fetcher DEBUG: Retrying <GET https://www.cbc.ca/news/business/leap-day-working-for-free-1.7127313?cmp=rss> (failed 2 times): User timeout caused connection failure: Getting https://www.cbc.ca/news/business/leap-day-working-for-free-1.7127313?cmp=rss took longer than 60.0 seconds..
2024-03-01 02:39:42,380 2651c192a1f6 fetcher DEBUG: Retrying <GET https://www.cbc.ca/news/canada/ontario-medcheck-shoppers-drug-mart-pressure-1.7126811?cmp=rss> (failed 2 times): User timeout caused connection failure: Getting https://www.cbc.ca/news/canada/ontario-medcheck-shoppers-drug-mart-pressure-1.7126811?cmp=rss took longer than 60.0 seconds..
2024-03-01 02:39:49,781 2651c192a1f6 fetcher DEBUG: Retrying <GET https://www.cbc.ca/news/canada/nova-scotia/seniors-call-extensive-landline-outage-safety-concern-1.7126722?cmp=rss> (failed 2 times): User timeout caused connection failure: Getting https://www.cbc.ca/news/canada/nova-scotia/seniors-call-extensive-landline-outage-safety-concern-1.7126722?cmp=rss took longer than 60.0 seconds..

Some quick tries with curl and different U-A strings could not reproduce the problem.

philbudne commented 3 months ago

NOTE: rss-fetcher has been using new U-A string since 2024-02-23, story-indexer is not yet using it.

NullPxl commented 3 months ago

Ahhh ok that is probably it.

rahulbot commented 3 months ago

Note: #244 might fix this.

philbudne commented 3 months ago

New (mcmetadata) UA string was used in last night's batch fetch. Spotted this in the log files:

logs/messages.log.2024-03-07_03:2024-03-07 04:32:30,185 0aa001a9a3ee fetcher DEBUG: Crawled (200) <GET https://www.cbc.ca/player/play/2314122307687> (referer: None)
logs/messages.log.2024-03-07_03:2024-03-07 04:32:30,288 0aa001a9a3ee fetcher INFO: success: https://www.cbc.ca/player/play/2314122307687
logs/messages.log.2024-03-07_03:2024-03-07 04:32:30,293 8b5a293c87bf parser INFO: parsing https://www.cbc.ca/player/play/2314122307687: 192420 characters
logs/messages.log.2024-03-07_03:2024-03-07 04:32:30,442 8b5a293c87bf parser INFO: parsed https://www.cbc.ca/player/play/2314122307687 with trafilatura date 2024-03-04
logs/messages.log.2024-03-07_03:2024-03-07 04:32:30,443 8b5a293c87bf parser INFO: OK-trafilatura: https://www.cbc.ca/player/play/2314122307687
logs/messages.log.2024-03-07_03:2024-03-07 04:32:30,473 47bf6e9ff748 importer INFO: created: https://www.cbc.ca/player/play/2314122307687

The feeds seem to be happy:

rss_fetcher=# select url, last_new_stories, system_status from feeds where sources_id = 7333;
                      url                      |      last_new_stories      | system_status 
-----------------------------------------------+----------------------------+---------------
 https://www.cbc.ca/webfeed/rss/rss-canada     | 2024-03-07 19:01:45.894163 | Working
 https://www.cbc.ca/webfeed/rss/rss-politics   | 2024-03-07 07:36:14.088387 | Working
 https://www.cbc.ca/webfeed/rss/rss-world      | 2024-03-07 10:23:44.2196   | Working
 https://www.cbc.ca/webfeed/rss/rss-topstories | 2024-03-07 15:00:37.984623 | Working
 https://www.cbc.ca/webfeed/rss/rss-health     | 2024-03-07 19:18:01.184231 | Working
 https://www.cbc.ca/webfeed/rss/rss-technology | 2024-03-07 19:19:20.196521 | Working
 https://www.cbc.ca/webfeed/rss/rss-arts       | 2024-03-07 19:19:06.048299 | Working
 https://www.cbc.ca/webfeed/rss/rss-sports     | 2024-03-07 17:21:09.220438 | Working
 https://www.cbc.ca/webfeed/rss/rss-Indigenous | 2024-03-06 20:10:00.287295 | Working
 https://www.cbc.ca/webfeed/rss/rss-business   | 2024-03-06 20:10:31.026778 | Working

rss_fetcher=# select * from fetch_events where feed_id in (select id from feeds where sources_id = 7333) order by created_at desc limit 30;
    id     | feed_id |      event      |             note              |         created_at         
-----------+---------+-----------------+-------------------------------+----------------------------
 307270799 | 2463337 | fetch_succeeded | 0 skipped / 20 dup / 0 added  | 2024-03-07 19:20:40.206583
 307270778 | 2463341 | fetch_succeeded | 0 skipped / 20 dup / 0 added  | 2024-03-07 19:20:07.571864
 307270720 | 2463340 | fetch_succeeded | 0 skipped / 18 dup / 1 added  | 2024-03-07 19:19:20.196521
 307270708 | 2463339 | fetch_succeeded | 0 skipped / 19 dup / 1 added  | 2024-03-07 19:19:06.048299
 307270636 | 2463338 | fetch_succeeded | 0 skipped / 19 dup / 1 added  | 2024-03-07 19:18:01.184231
 307270479 | 2463336 | fetch_succeeded | 0 skipped / 20 dup / 0 added  | 2024-03-07 19:16:15.203579
 307269409 | 2463335 | fetch_succeeded | 0 skipped / 13 dup / 7 added  | 2024-03-07 19:01:45.894163
 307261971 | 2463342 | fetch_succeeded | 0 skipped / 15 dup / 4 added  | 2024-03-07 17:21:09.220438
 307251802 | 2463333 | fetch_succeeded | 0 skipped / 10 dup / 10 added | 2024-03-07 15:00:37.984623
 307232205 | 2463334 | fetch_succeeded | 0 skipped / 18 dup / 2 added  | 2024-03-07 10:23:44.2196
 307226461 | 2463342 | fetch_succeeded | 0 skipped / 16 dup / 4 added  | 2024-03-07 09:01:06.139446
 307220578 | 2463335 | fetch_succeeded | 0 skipped / 19 dup / 1 added  | 2024-03-07 07:41:45.315019
 307220466 | 2463337 | fetch_succeeded | 0 skipped / 20 dup / 0 added  | 2024-03-07 07:40:35.314268
 307220411 | 2463341 | fetch_succeeded | same hash                     | 2024-03-07 07:40:05.347399
 307220350 | 2463340 | fetch_succeeded | same hash                     | 2024-03-07 07:39:16.099611
 307220340 | 2463339 | fetch_succeeded | 0 skipped / 20 dup / 0 added  | 2024-03-07 07:39:05.543653
 307220243 | 2463338 | fetch_succeeded | 0 skipped / 18 dup / 1 added  | 2024-03-07 07:38:00.336182
 307220099 | 2463336 | fetch_succeeded | 0 skipped / 19 dup / 1 added  | 2024-03-07 07:36:14.088387
 307216548 | 2463333 | fetch_succeeded | 0 skipped / 11 dup / 9 added  | 2024-03-07 06:50:33.572889
 307184825 | 2463342 | fetch_succeeded | 0 skipped / 15 dup / 5 added  | 2024-03-07 00:51:05.352953
 307158408 | 2463334 | fetch_succeeded | 0 skipped / 18 dup / 2 added  | 2024-03-06 22:53:43.036737
 307151083 | 2463333 | fetch_succeeded | 0 skipped / 15 dup / 5 added  | 2024-03-06 22:40:32.096999
 307094948 | 2463335 | fetch_succeeded | 0 skipped / 13 dup / 7 added  | 2024-03-06 20:31:40.979472
 307093524 | 2463337 | fetch_succeeded | 0 skipped / 18 dup / 2 added  | 2024-03-06 20:10:31.026778
 307093486 | 2463341 | fetch_succeeded | 0 skipped / 19 dup / 1 added  | 2024-03-06 20:10:00.287295
 307093443 | 2463340 | fetch_succeeded | 0 skipped / 20 dup / 0 added  | 2024-03-06 20:09:15.315718
 307093437 | 2463339 | fetch_succeeded | 0 skipped / 18 dup / 2 added  | 2024-03-06 20:09:01.216995
 307093372 | 2463338 | fetch_succeeded | 0 skipped / 20 dup / 0 added  | 2024-03-06 20:07:56.585521
 307093045 | 2463336 | fetch_succeeded | 0 skipped / 19 dup / 1 added  | 2024-03-06 20:06:12.464016
 307070207 | 2463342 | fetch_succeeded | 0 skipped / 15 dup / 5 added  | 2024-03-06 16:50:55.288055
(30 rows)
NullPxl commented 3 months ago

Nice thanks, good sign 👍

rahulbot commented 2 months ago

Closing as resolved by changes.