mediacloud / rss-fetcher

Intelligently fetch lists of URLs from a large collection of RSS Feeds as part of the Media Cloud Directory.
https://search.mediacloud.org/directory
Apache License 2.0
5 stars 5 forks source link

why is story volume so low compared to prod system? #10

Closed rahulbot closed 4 months ago

rahulbot commented 2 years ago

Hhow come the backup RSS fetcher only pulls in ~250k stories each day, while the prod system finds almost a million? I've updated the backup one so it has all the RSS feeds that have:

That matches 144,269 feeds in the prod database (as of yesterday). Another difference is that this backup one only stores precisely unique URLs.

Are there a bunch of topics running that are spidering things? Is my query filter to find "actually active" feeds overly limited? Does my logic to check them "regularly" have a bug? Thoughts on other potential causes of this big difference?

rahulbot commented 2 years ago

For investigating, feeds are stored in S3 buckets with dates encoded in filenames like this:

rahulbot commented 2 years ago

Average daily volume over the last week or two:

rahulbot commented 2 years ago

Added more feeds in 0396e12. This time it is a dump that checked last_successfull_download, not last_new_story.

philbudne commented 2 years ago

reposted from email:

most of the backup RSS file is a single line!

    lines     words     chars
  4175211   5280699 315561091 prod/mc-2022-06-01.rss
       16   1443161  98355897 backup/mc-2022-06-01.rss

Looking at files from 2022-06-01. For each feed file I extracted just the text between and tags into a file named "urls":

  1043783   1043783 119202388 prod/urls
   262124    262124  28563312 backup/urls

The feeding both url files thru "sort -u" to get unique URLs shows there are duplicate urls in both files:

   640483    640483  71922300 prod/urls.uniq
   256867    256867  27982221 backup/urls.uniq

The "domains" file discarded any http*:// prefix, and the remainder of the URL past the domain:

  1043783   1041765  17769530 prod/domains
   262124    261693   3947539 backup/domains

The count of unique domains (sort -u on domains) in each file isn't very different:

    16071     16070    312962 prod/domains.uniq
    16872     16871    358861 backup/domains.uniq

feeding each domain file thru the pipeline sort | uniq -c | sort -rn gives domain frequencies.

Here are the top 20 for each:

$ head -20 prod/domain-counts backup/domain-counts
==> prod/domain-counts <==
  20283 news.google.com
  18695 www.mk.ru
  18251 www.bizjournals.com
   8769 dnrtv.org.vn
   8433 www.shawlocal.com
   6298 www.radionacional.com.ar
   5840 aif.ru
   5487 www.dailytelegraph.com.au
   5119 www.stuff.co.nz
   4972 timesofindia.indiatimes.com
   4687 globalnews.ca
   4431 bnr.bg
   4386 www.bignewsnetwork.com
   4088 www.svt.se
   4083 www.washingtonpost.com
   3925 udn.com
   3881 blog.goo.ne.jp
   3818 www.chinanews.com
   3814 www.iol.co.za
   3611 www.heraldsun.com.au

==> backup/domain-counts <==
   2730 oglecountylife.com
   2481 latestnigeriannews.com
   1699 amarujala.com
   1499 google.com
   1205 uol.com.br
   1126 bhaskar.com
    961 udn.com
    956 timesofindia.indiatimes.com
    920 irna.ir
    838 pantip.com
    682 iribnews.ir
    674 haberler.com
    664 madhyamam.com
    629 globenewswire.com
    625 tw.news.yahoo.com
    595 lanacion.com.ar
    590 bnr.bg
    587 finanznachrichten.de
    583 navbharattimes.indiatimes.com
    580 menafn.com

Looking at the top six domains in the production feed, NONE of them appear (at all) in the backup/domains

An example news.google.com URL: https://news.google.com/__i/rss/rd/articles/CBMidGh0dHBzOi8vZzEuZ2xvYm8uY29tL2RmL2Rpc3RyaXRvLWZlZGVyYWwvbm90aWNpYS8yMDIyLzA2LzAxL25vdmEtY25oLXZlamEtY29tby1zZXJhLW5vdm8tbW9kZWxvLWRhLWhhYmlsaXRhY2FvLmdodG1s0gF_aHR0cHM6Ly9nMS5nbG9iby5jb20vZ29vZ2xlL2FtcC9kZi9kaXN0cml0by1mZWRlcmFsL25vdGljaWEvMjAyMi8wNi8wMS9ub3ZhLWNuaC12ZWphLWNvbW8tc2VyYS1ub3ZvLW1vZGVsby1kYS1oYWJpbGl0YWNhby5naHRtbA?oc=5<

Yields a redirect to: https://g1.globo.com/df/distrito-federal/noticia/2022/06/01/nova-cnh-veja-como-sera-novo-modelo-da-habilitacao.ghtml

Another: https://news.google.com/__i/rss/rd/articles/CBMidGh0dHBzOi8vZzEuZ2xvYm8uY29tL2RmL2Rpc3RyaXRvLWZlZGVyYWwvbm90aWNpYS8yMDIyLzA2LzAxL25vdmEtY25oLXZlamEtY29tby1zZXJhLW5vdm8tbW9kZWxvLWRhLWhhYmlsaXRhY2FvLmdodG1s0gF_aHR0cHM6Ly9nMS5nbG9iby5jb20vZ29vZ2xlL2FtcC9kZi9kaXN0cml0by1mZWRlcmFsL25vdGljaWEvMjAyMi8wNi8wMS9ub3ZhLWNuaC12ZWphLWNvbW8tc2VyYS1ub3ZvLW1vZGVsby1kYS1oYWJpbGl0YWNhby5naHRtbA?oc=5<

Redirects to: https://g1.globo.com/df/distrito-federal/noticia/2022/06/01/nova-cnh-veja-como-sera-novo-modelo-da-habilitacao.ghtml

philbudne commented 2 years ago

Rahul replied:

interesting stuff. Nice digging. So it sounds like a few top level things you’re seeing: 1) our current system pulls in lots more duplicate URLs 2) the list of domains in both are pretty similar 3) there are some domains missing in the backup that are pulling a lot of stories

Is that about right?

Google news RSS links are weird. Perhaps ask Emily and Rebecca what they’ve seen with Google News RSS feeds or URLs and if there is a standard way we deal with them. I wonder what media source that first example was associated with. For instance, we have a Globo source (60427https://sources.mediacloud.org/#/sources/60427/feeds) with a handful of feeds that look active. Are those feeds in the backups database?

philbudne commented 2 years ago

Some more queries, looking at globo:


mediacloud=# select distinct media_id from stories where collect_date > '2022-06-22' and url like '%globo.com%';
 media_id 
----------
    40252
   654106
    83352
    97032
    83399
    60347
   101943
    60427
    65275
(9 rows)

mediacloud=# select count(1), media_id from stories where collect_date > '2022-06-22' and url like '%globo.com%' group by media_id;
 count | media_id 
-------+----------
     5 |    40252
    56 |   654106
    75 |    83352
   358 |    97032
   359 |    83399
     5 |    60347
   220 |   101943
  1443 |    60427
     1 |    65275
(9 rows)

60427 (suggested by Rahul) is the largest.

mediacloud=# select * from media where media_id = 60427;
media_id |         url          |    normalized_url    | name  | full_text_rss | foreign_rss_links | dup_media_id | is_not_dup | content_delay | editor_notes | public_notes | is_monitored 
----------+----------------------+----------------------+-------+---------------+-------------------+--------------+------------+---------------+--------------+--------------+--------------
    60427 | http://g1.globo.com/ | http://g1.globo.com/ | Globo | f             | f                 |              | t          |               |              |
              | t
(1 row)

mediacloud=# select * from feeds where media_id = 60427;
 feeds_id | media_id |                   name                    |                                             url                                             |    type 
   | active |          last_checksum           | last_attempted_download_time  | last_successful_download_time |      last_new_story_time      
----------+----------+-------------------------------------------+---------------------------------------------------------------------------------------------+---------
---+--------+----------------------------------+-------------------------------+-------------------------------+-------------------------------
   951383 |    60427 | Spider Feed                               | http://g1.globo.com/#spiderfeed                                                             | syndicat
ed | f      |                                  |                               |                               | 
   137181 |    60427 | Globo                                     | http://g1.globo.com/                                                                        | web_page
   | f      |                                  | 2019-04-01 00:19:45.549935-04 | 2019-04-01 00:33:11-04        | 2019-04-01 00:19:45.549935-04
   113355 |    60427 | G1                                        | http://g1.globo.com/dynamo/rss2.xml                                                         | syndicat
ed | t      | 87725197ed4caa7bcc1486c0568be589 | 2022-06-23 13:59:48.436918-04 | 2022-06-23 13:29:56.982717-04 | 2022-06-23 13:29:48.415607-04
  1606188 |    60427 | G1 – Semana Pop                         | https://audio.globoradio.globo.com/podcast/feed/539/semana-pop                              | syndicated 
| t      | 921b766d5751972f07f84547bf45e6da | 2022-06-17 20:27:35.337227-04 | 2022-06-17 20:28:15.087585-04 | 2020-10-31 12:43:31.867907-04
  1606191 |    60427 | G1 - Livro Falado                         | https://audio.globoradio.globo.com/podcast/feed/592/g1-livro-falado                         | syndicat
ed | t      | 8995124c813ac0ff30bef87463b54c4a | 2022-06-17 20:27:35.337227-04 | 2022-06-17 20:28:15.087585-04 | 2019-09-09 22:23:09.784729-04
  1606189 |    60427 | G1 – Educação Financeira              | https://audio.globoradio.globo.com/podcast/feed/531/educacao-financeira                     | syndicated | t 
     | 2974511423b7ace55c8bd7881aba8583 | 2022-06-20 04:24:18.2106-04   | 2022-06-20 04:24:22.028323-04 | 2022-06-14 20:59:14.929969-04
  1049867 |    60427 | Globo                                     | https://g1.globo.com/                                                                       | web_page
   | t      |                                  | 2022-06-17 20:27:35.337227-04 | 2022-06-17 20:28:15.087585-04 | 2022-06-17 20:27:35.337227-04
  1606190 |    60427 | G1 ouviu - seu guia de novidades musicais | https://audio.globoradio.globo.com/podcast/feed/537/g1-ouviu-seu-guia-de-novidades-musicais | syndicat
ed | t      | b8c109da3ec2cd14d47b8a8192e7043b | 2022-06-21 15:59:30.927874-04 | 2022-06-21 15:59:38.795003-04 | 2022-06-19 01:27:31.091694-04
(8 rows)
philbudne commented 2 years ago

Some poking a news.google.com feeds:

mediacloud=# select count(1), media_id from stories where collect_date > '2022-06-22' and url like '%news.google.com%' group by media_id;
 count | media_id 
-------+----------
   653 |   651253
   679 |   651280
   884 |   651262
   460 |   651272
   625 |   348022
   264 |   449219
   382 |   361078
   766 |   651261
   621 |   651270
   729 |   651263
    10 |   295679
   738 |   651257
   751 |   651252
    79 |    59102
   788 |   651277
   663 |   651258
   689 |   651251
   729 |   651273
   177 |   440044
   616 |   651265
   647 |   651269
   688 |   651260
   870 |   651255
   690 |   651279
   779 |   651250
   669 |   651271
   854 |   651268
   943 |   365945
   478 |   375820
   672 |   651278
   603 |   651266
   720 |   649508
   797 |    59984
   799 |   375782
   333 |    27642
   741 |   651259
   788 |   372638
   607 |   651267
   619 |   651254
   770 |   416841
   752 |   651281
   302 |   390374
   565 |   651256
   562 |   651264
   567 |   651274
   202 |    25927
   914 |   651276
  1094 |   651275
(48 rows)

mediacloud=# select * from media where media_id = 651275
mediacloud-# ;
 media_id |                 url                  |         normalized_url         |         name         | full_text_rss | foreign_rss_links | dup_media_id | is_not_dup | content_delay | editor_notes | public_notes | is_monitored 
----------+--------------------------------------+--------------------------------+----------------------+---------------+-------------------+--------------+------------+---------------+--------------+--------------+--------------
   651275 | https://news.google.com/news/?ned=kr | http://google.com/news/?ned=kr | Google - South Korea |               | t                 |              |            |             0 |              |              | t
(1 row)

mediacloud=# select * from media where url like 'https://news.google.com/news%';
 media_id |                        url                        |               normalized_url                |                  name                  | full_text_rss | for
eign_rss_links | dup_media_id | is_not_dup | content_delay | editor_notes | public_notes | is_monitored 
----------+---------------------------------------------------+---------------------------------------------+----------------------------------------+---------------+----
---------------+--------------+------------+---------------+--------------+--------------+--------------
   651251 | https://news.google.com/news/?ned=bg_bg           | http://google.com/news/?ned=bg_bg           | Google - Bulgaria                      |               | t  
               |              |            |             0 |              |              | f
   651258 | https://news.google.com/news/?ned=hi_in           | http://google.com/news/?ned=hi_in           | Google (Hindi)                         |               | t  
               |              |            |             0 |              |              | f
   651259 | https://news.google.com/news/?ned=ml_in           | http://google.com/news/?ned=ml_in           | Google (Malayalam)                     |               | t  
               |              |            |             0 |              |              | f
   651260 | https://news.google.com/news/?ned=ta_in           | http://google.com/news/?ned=ta_in           | Google (Tamil)                         |               | t  
               |              |            |             0 |              |              | f
   651261 | https://news.google.com/news/?ned=te_in           | http://google.com/news/?ned=te_in           | Google (Telugu)                        |               | t  
               |              |            |             0 |              |              | f
   449219 | https://news.google.com/news/rss/?ned=tw&hl=zh-TW | http://google.com/news/rss/?ned=tw&hl=zh-tw | Google News Taiwan                     |               | t  
               |              |            |             0 |              |              | t
   651278 | https://news.google.com/news/?ned=ru_ua           | http://google.com/news/?ned=ru_ua           | Google (Russian)                       |               | t  
               |              |            |             0 |              |              | f
   651281 | https://news.google.com/news/?ned=vi_vn           | http://google.com/news/?ned=vi_vn           | Google - Vietnam                       |               | t  
               |              |            |             0 |              |              | t
   651279 | https://news.google.com/news/?ned=uk_ua           | http://google.com/news/?ned=uk_ua           | Google (Ukrainian)                     |               | t  
               |              |            |             0 |              |              | f
   651272 | https://news.google.com/news/?ned=ar_sa           | http://google.com/news/?ned=ar_sa           | Google - Saudi Arabia                  |               | t  
               |              |            |             0 |              |              | t
   651274 | https://news.google.com/news/?ned=sk_sk           | http://google.com/news/?ned=sk_sk           | Google - Slovakia                      |               | t  
               |              |            |             0 |              |              | t
   651263 | https://news.google.com/news/?ned=iw_il           | http://google.com/news/?ned=iw_il           | Google (Hebrew)                        |               | t  
               |              |            |             0 |              |              | t
   416841 | https://news.google.com/news?ned=cn               | http://google.com/news?ned=cn               | Google News China                      |               | t  
               |              |            |             0 |              |              | t
   651250 | https://news.google.com/news/?ned=pt-BR_br        | http://google.com/news/?ned=pt-br_br        | Google - Brazil                        |               | t  
               |              |            |             0 |              |              | t
   651256 | https://news.google.com/news/?ned=hk              | http://google.com/news/?ned=hk              | Google - Hong Kong                     |               | t  
               |              |            |             0 |              |              | t
   649508 | https://news.google.com/news/?ned=bn_bd           | http://google.com/news/?ned=bn_bd           | Google - Bangladesh                    |               | t  
               |              |            |             0 |              |              | t
   651254 | https://news.google.com/news/?ned=ar_eg           | http://google.com/news/?ned=ar_eg           | Google - Egypt                         |               | t  
               |              |            |             0 |              |              | t
   651255 | https://news.google.com/news/?ned=el_gr           | http://google.com/news/?ned=el_gr           | Google - Greece                        |               | t  
               |              |            |             0 |              |              | t
   651253 | https://news.google.com/news/?ned=cs_cz           | http://google.com/news/?ned=cs_cz           | Google - Czech Republic                |               | t  
               |              |            |             0 |              |              | t
   651269 | https://news.google.com/news/?ned=pt-PT_pt        | http://google.com/news/?ned=pt-pt_pt        | Google - Portugal                      |               | t  
               |              |            |             0 |              |              | t
   651275 | https://news.google.com/news/?ned=kr              | http://google.com/news/?ned=kr              | Google - South Korea                   |               | t  
               |              |            |             0 |              |              | t
   651268 | https://news.google.com/news/?ned=pl_pl           | http://google.com/news/?ned=pl_pl           | Google - Poland                        |               | t  
               |              |            |             0 |              |              | t
   651252 | https://news.google.com/news/?ned=cn              | http://google.com/news/?ned=cn              | Google - China (National)              |               | t  
               |              |            |             0 |              |              | t
   651265 | https://news.google.com/news/?ned=ar_lb           | http://google.com/news/?ned=ar_lb           | Google - Lebanon                       |               | t  
               |              |            |             0 |              |              | t
   651276 | https://news.google.com/news/?ned=tw              | http://google.com/news/?ned=tw              | Google - Taiwan                        |               | t  
               |              |            |             0 |              |              | t
   651271 | https://news.google.com/news/?ned=ru_ru           | http://google.com/news/?ned=ru_ru           | Google - Russia                        |               | t  
               |              |            |             0 |              |              | f
   651270 | https://news.google.com/news/?ned=ro_ro           | http://google.com/news/?ned=ro_ro           | Google - Romania                       |               | t  
               |              |            |             0 |              |              | f
   651257 | https://news.google.com/news/?ned=hu_hu           | http://google.com/news/?ned=hu_hu           | Google - Hungary                       |               | t  
               |              |            |             0 |              |              | f
   651267 | https://news.google.com/news/?ned=ar_me           | http://google.com/news/?ned=ar_me           | Google - Near and Middle East Regional |               | t  
               |              |            |             0 |              |              | f
   651273 | https://news.google.com/news/?ned=sr_rs           | http://google.com/news/?ned=sr_rs           | Google - Serbia                        |               | t  
               |              |            |             0 |              |              | f
   651277 | https://news.google.com/news/?ned=tr_tr           | http://google.com/news/?ned=tr_tr           | Google - Turkey                        |               | t  
               |              |            |             0 |              |              | f
   651264 | https://news.google.com/news/?ned=lv_lv           | http://google.com/news/?ned=lv_lv           | Google - Latvia                        |               | t  
               |              |            |             0 |              |              | f
   651266 | https://news.google.com/news/?ned=lt_lt           | http://google.com/news/?ned=lt_lt           | Google - Lithuania                     |               | t  
               |              |            |             0 |              |              | t
   651280 | https://news.google.com/news/?ned=ar_ae           | http://google.com/news/?ned=ar_ae           | Google - United Arab Emirates          |               | t  
               |              |            |             0 |              |              | t
   651262 | https://news.google.com/news/?ned=id_id           | http://google.com/news/?ned=id_id           | Google - Indonesia                     |               | t  
               |              |            |             0 |              |              | t
(35 rows)

mediacloud=# select * from feeds where url like 'https://news.google.com/news%';
 feeds_id | media_id |                             name                              |                                   url                                   |    type  
  | active |          last_checksum           | last_attempted_download_time  | last_successful_download_time |      last_new_story_time      
----------+----------+---------------------------------------------------------------+-------------------------------------------------------------------------+----------
--+--------+----------------------------------+-------------------------------+-------------------------------+-------------------------------
   858131 |   375820 | More Top Stories - Google News                                | https://news.google.com/news/rss/                                       | syndicate
d | f      | 8751f64868a480dc9d5aa18915cccdde | 2018-03-28 12:09:40.16676-04  | 2018-03-28 13:08:17-04        | 2018-03-28 12:09:40.16676-04
   635665 |   295679 | More Top Stories - Google News                                | https://news.google.com/news/rss/                                       | syndicate
d | f      | 128d5ff62cd3c67dd015488b1e58697d | 2018-03-28 12:09:40.16676-04  | 2018-03-28 13:15:43-04        | 2018-03-28 12:09:40.16676-04
   858151 |   651265 | More Top Stories - Google News                                | https://news.google.com/news/rss/                                       | syndicate
d | t      | 37c106ba0b42aa012ebe1380f3d3ad36 | 2022-06-23 13:59:48.436918-04 | 2022-06-23 13:29:56.982717-04 | 2022-06-23 13:29:48.415607-04
   619144 |   449219 | Google News Taiwan                                            | https://news.google.com/news/rss/?ned=tw&hl=zh-TW/data/rss              | syndicate
d | t      | 23cc660e1753546e174271379fdcfabe | 2022-06-18 21:27:26.04296-04  | 2022-06-18 21:27:30.30171-04  | 2022-06-08 07:57:54.41743-04
   858150 |   651263 | More Top Stories - Google News                                | https://news.google.com/news/rss/                                       | syndicate
d | t      | 37c106ba0b42aa012ebe1380f3d3ad36 | 2022-06-23 13:59:48.436918-04 | 2022-06-23 13:29:56.982717-04 | 2022-06-23 13:29:48.415607-04
   858146 |   651260 | More Top Stories - Google News                                | https://news.google.com/news/rss/                                       | syndicate
d | t      | 37c106ba0b42aa012ebe1380f3d3ad36 | 2022-06-23 13:59:48.436918-04 | 2022-06-23 13:29:56.982717-04 | 2022-06-23 13:29:48.415607-04
   858161 |   651272 | More Top Stories - Google News                                | https://news.google.com/news/rss/                                       | syndicate
d | t      | e55bfcc8f0292c22ec64b826066da046 | 2022-06-23 13:59:48.436918-04 | 2022-06-23 13:29:56.982717-04 | 2022-06-23 13:29:48.415607-04
   585815 |   416841 | Top Stories - Google News                                     | https://news.google.com/news/rss/?ned=cn&hl=zh-CN                       | syndicate
d | t      | 9d7cf6e5b5fa3f061dcef5b627a01b7f | 2022-06-21 07:29:21.384175-04 | 2022-06-18 19:57:29.996409-04 | 2022-06-18 16:27:19.877431-04
   858145 |   651259 | More Top Stories - Google News                                | https://news.google.com/news/rss/                                       | syndicate
d | t      | 37c106ba0b42aa012ebe1380f3d3ad36 | 2022-06-23 13:59:48.436918-04 | 2022-06-23 13:29:56.982717-04 | 2022-06-23 13:29:48.415607-04
   858168 |   651280 | More Top Stories - Google News                                | https://news.google.com/news/rss/                                       | syndicate
mediacloud=# select * from feeds where url like 'https://news.google.com/news%' and active = 't' and last_successful_download_time > '2022-06-01'
mediacloud-# ;
 feeds_id | media_id |                             name                              |                                   url                                   |    type  
  | active |          last_checksum           | last_attempted_download_time  | last_successful_download_time |      last_new_story_time      
----------+----------+---------------------------------------------------------------+-------------------------------------------------------------------------+----------
--+--------+----------------------------------+-------------------------------+-------------------------------+-------------------------------
   609336 |   440044 | Top Stories - Google News                                     | https://news.google.com/news/rss/?ned=hk&hl=zh-HK                       | syndicate
d | t      | 068bf9cc49c24f49820e57e5b68323d2 | 2022-06-23 13:59:48.436918-04 | 2022-06-23 11:59:55.7679-04   | 2022-06-23 10:29:44.990395-04
   619144 |   449219 | Google News Taiwan                                            | https://news.google.com/news/rss/?ned=tw&hl=zh-TW/data/rss              | syndicate
d | t      | 23cc660e1753546e174271379fdcfabe | 2022-06-18 21:27:26.04296-04  | 2022-06-18 21:27:30.30171-04  | 2022-06-08 07:57:54.41743-04
  1096396 |   449219 | More Top Stories - Google News                                | https://news.google.com/news/rss/?ned=tw&hl=zh-TW&gl=TW/feeds           | syndicate
d | t      | 543fd42adfbc6d7bc67f4e769a9d7b90 | 2022-06-22 16:58:28.989124-04 | 2022-06-22 16:58:37.524483-04 | 2022-06-20 01:24:13.490346-04
   982298 |   375820 | https://news.google.com/news/rss/headlines?ned=fr&gl=FR&hl=fr | https://news.google.com/news/rss/headlines?ned=fr&gl=FR&hl=fr           | syndicate
d | t      | 876a2d7e46a5a8c5a0006bb19e69537a | 2022-06-18 12:57:14.867751-04 | 2022-06-18 12:57:21.183534-04 | 2022-06-10 20:35:41.291772-04
   585815 |   416841 | Top Stories - Google News                                     | https://news.google.com/news/rss/?ned=cn&hl=zh-CN                       | syndicate
d | t      | 9d7cf6e5b5fa3f061dcef5b627a01b7f | 2022-06-21 07:29:21.384175-04 | 2022-06-18 19:57:29.996409-04 | 2022-06-18 16:27:19.877431-04
   982299 |   295679 | https://news.google.com/news/rss/?ned=en_ng&gl=NG&hl=en       | https://news.google.com/news/rss/?ned=en_ng&gl=NG&hl=en                 | syndicate
d | t      | 6d44b644a5e627c8159808b32c3e0718 | 2022-06-22 00:59:43.200625-04 | 2022-06-19 09:24:02.672628-04 | 2022-06-19 09:23:56.217234-04
   844771 |   649508 | More Top Stories - Google News                                | https://news.google.com/news/rss/                                       | syndicate
d | t      | 37c106ba0b42aa012ebe1380f3d3ad36 | 2022-06-23 13:59:48.436918-04 | 2022-06-23 13:29:56.982717-04 | 2022-06-23 13:29:48.415607-04
   858151 |   651265 | More Top Stories - Google News                                | https://news.google.com/news/rss/                                       | syndicate
d | t      | 37c106ba0b42aa012ebe1380f3d3ad36 | 2022-06-23 13:59:48.436918-04 | 2022-06-23 13:29:56.982717-04 | 2022-06-23 13:29:48.415607-04
   858152 |   651264 | More Top Stories - Google News                                | https://news.google.com/news/rss/                                       | syndicate
d | t      | 37c106ba0b42aa012ebe1380f3d3ad36 | 2022-06-23 13:59:48.436918-04 | 2022-06-23 13:29:56.982717-04 | 2022-06-23 13:29:48.415607-04
   858149 |   375782 | More Top Stories - Google News                                | https://news.google.com/news/rss/                                       | syndicate
d | t      | e55bfcc8f0292c22ec64b826066da046 | 2022-06-23 13:59:48.436918-04 | 2022-06-23 13:29:56.982717-04 | 2022-06-23 13:29:48.415607-04
   858145 |   651259 | More Top Stories - Google News                                | https://news.google.com/news/rss/                                       | syndicate
d | t      | 37c106ba0b42aa012ebe1380f3d3ad36 | 2022-06-23 13:59:48.436918-04 | 2022-06-23 13:29:56.982717-04 | 2022-06-23 13:29:48.415607-04
   858168 |   651280 | More Top Stories - Google News                                | https://news.google.com/news/rss/                                       | syndicate
d | t      | 37c106ba0b42aa012ebe1380f3d3ad36 | 2022-06-23 13:59:48.436918-04 | 2022-06-23 13:29:56.982717-04 | 2022-06-23 13:29:48.415607-04
   858146 |   651260 | More Top Stories - Google News                                | https://news.google.com/news/rss/                                       | syndicate
d | t      | 37c106ba0b42aa012ebe1380f3d3ad36 | 2022-06-23 13:59:48.436918-04 | 2022-06-23 13:29:56.982717-04 | 2022-06-23 13:29:48.415607-04
   858166 |   651277 | More Top Stories - Google News                                | https://news.google.com/news/rss/                                       | syndicate
d | t      | e55bfcc8f0292c22ec64b826066da046 | 2022-06-23 13:59:48.436918-04 | 2022-06-23 13:29:56.982717-04 | 2022-06-23 13:29:48.415607-04
   858161 |   651272 | More Top Stories - Google News                                | https://news.google.com/news/rss/                                       | syndicate
d | t      | e55bfcc8f0292c22ec64b826066da046 | 2022-06-23 13:59:48.436918-04 | 2022-06-23 13:29:56.982717-04 | 2022-06-23 13:29:48.415607-04
   858150 |   651263 | More Top Stories - Google News                                | https://news.google.com/news/rss/                                       | syndicate
d | t      | 37c106ba0b42aa012ebe1380f3d3ad36 | 2022-06-23 13:59:48.436918-04 | 2022-06-23 13:29:56.982717-04 | 2022-06-23 13:29:48.415607-04
   858144 |   651257 | More Top Stories - Google News                                | https://news.google.com/news/rss/                                       | syndicate
d | t      | 37c106ba0b42aa012ebe1380f3d3ad36 | 2022-06-23 13:59:48.436918-04 | 2022-06-23 13:29:56.982717-04 | 2022-06-23 13:29:48.415607-04
   858159 |   651270 | More Top Stories - Google News                                | https://news.google.com/news/rss/                                       | syndicate
d | t      | 37c106ba0b42aa012ebe1380f3d3ad36 | 2022-06-23 13:59:48.436918-04 | 2022-06-23 13:29:56.982717-04 | 2022-06-23 13:29:48.415607-04
   811090 |   348022 | More Top Stories - Google News                                | https://news.google.com/news/rss/                                       | syndicate
d | t      | 37c106ba0b42aa012ebe1380f3d3ad36 | 2022-06-23 13:59:48.436918-04 | 2022-06-23 13:29:56.982717-04 | 2022-06-23 13:29:48.415607-04
   858158 |   651269 | More Top Stories - Google News                                | https://news.google.com/news/rss/                                       | syndicate
d | t      | e55bfcc8f0292c22ec64b826066da046 | 2022-06-23 13:59:48.436918-04 | 2022-06-23 13:29:56.982717-04 | 2022-06-23 13:29:48.415607-04
   858155 |   651262 | More Top Stories - Google News                                | https://news.google.com/news/rss/                                       | syndicate
d | t      | 2c7dfdfd798c2171e8f9eb3ee028e371 | 2022-06-23 13:59:48.436918-04 | 2022-06-23 13:29:56.982717-04 | 2022-06-23 13:29:48.415607-04
   858134 |   651250 | More Top Stories - Google News                                | https://news.google.com/news/rss/                                       | syndicate
d | t      | b2be40894860c04223e2a27fc3293eca | 2022-06-23 13:59:48.436918-04 | 2022-06-23 13:29:56.982717-04 | 2022-06-23 13:29:48.415607-04
  1647533 |    55491 | "hemp" - Google News                                          | https://news.google.com/news?cf=all&hl=en&pz=1&ned=us&q=hemp&output=rss | syndicate
d | t      | 32924a978b84538a3b5e9ab9ab5b111b | 2022-06-23 13:59:48.436918-04 | 2022-06-23 13:29:56.982717-04 | 2022-06-23 13:29:48.415607-04
   858147 |   651261 | More Top Stories - Google News                                | https://news.google.com/news/rss/                                       | syndicate
d | t      | b2be40894860c04223e2a27fc3293eca | 2022-06-23 13:59:48.436918-04 | 2022-06-23 13:29:56.982717-04 | 2022-06-23 13:29:48.415607-04
   858135 |   651251 | More Top Stories - Google News                                | https://news.google.com/news/rss/                                       | syndicate
d | t      | b2be40894860c04223e2a27fc3293eca | 2022-06-23 13:59:48.436918-04 | 2022-06-23 13:29:56.982717-04 | 2022-06-23 13:29:48.415607-04
   858160 |   651271 | More Top Stories - Google News                                | https://news.google.com/news/rss/                                       | syndicate
d | t      | 2c7dfdfd798c2171e8f9eb3ee028e371 | 2022-06-23 13:59:48.436918-04 | 2022-06-23 13:29:56.982717-04 | 2022-06-23 13:29:48.415607-04
   858137 |   651252 | More Top Stories - Google News                                | https://news.google.com/news/rss/                                       | syndicate
d | t      | b3ffebc77abac07dbab61ec47ba200d7 | 2022-06-23 13:59:48.436918-04 | 2022-06-23 13:29:56.982717-04 | 2022-06-23 13:29:48.415607-04
   858169 |   651279 | More Top Stories - Google News                                | https://news.google.com/news/rss/                                       | syndicate
d | t      | f9e54f98dfaa4/?ned=us&gl=US&hl=en                    | syndicate
d | t      | f9e54f98dfaa4333b09bfb864d9423be | 2022-06-23 13:59:48.436918-04 | 2022-06-23 13:29:56.982717-04 | 2022-06-23 13:29:48.415607-04
   695317 |    59984 | Top Stories - Google News                                     | https://news.google.com/news/rss/                                       | syndicate
d | t      | 2c7dfdfd798c2171e8f9eb3ee028e371 | 2022-06-23 13:59:48.436918-04 | 2022-06-23 13:29:56.982717-04 | 2022-06-23 13:29:48.415607-04
   858156 |   365945 | More Top Stories - Google News                                | https://news.google.com/news/rss/                                       | syndicate
d | t      | 99f9a2390dcd85af53420fdad74fefa3 | 2022-06-23 13:59:48.436918-04 | 2022-06-23 13:29:56.982717-04 | 2022-06-23 13:29:48.415607-04
   858140 |   651254 | More Top Stories - Google News                                | https://news.google.com/news/rss/                                       | syndicate
d | t      | 160f4e1da8a34dd8133eaed103b50601 | 2022-06-23 13:59:48.436918-04 | 2022-06-23 13:29:56.982717-04 | 2022-06-23 13:29:48.415607-04
   858154 |   651267 | More Top Stories - Google News                                | https://news.google.com/news/rss/                                       | syndicate
d | t      | 4cc7dcbc3cbfa7bcf2082581c63911a8 | 2022-06-23 13:59:48.436918-04 | 2022-06-23 13:29:56.982717-04 | 2022-06-23 13:29:48.415607-04
   858167 |   651278 | More Top Stories - Google News                                | https://news.google.com/news/rss/                                       | syndicate
d | t      | 99f9a2390dcd85af53420fdad74fefa3 | 2022-06-23 13:59:48.436918-04 | 2022-06-23 13:29:56.982717-04 | 2022-06-23 13:29:48.415607-04
   858142 |   651256 | More Top Stories - Google News                                | https://news.google.com/news/rss/                                       | syndicate
d | t      | 99f9a2390dcd85af53420fdad74fefa3 | 2022-06-23 13:59:48.436918-04 | 2022-06-23 13:29:56.982717-04 | 2022-06-23 13:29:48.415607-04
   858138 |   416841 | More Top Stories - Google News                                | https://news.google.com/news/rss/                                       | syndicate
d | t      | 99f9a2390dcd85af53420fdad74fefa3 | 2022-06-23 13:59:48.436918-04 | 2022-06-23 13:29:56.982717-04 | 2022-06-23 13:29:48.415607-04
   858162 |   651273 | More Top Stories - Google News                                | https://news.google.com/news/rss/                                       | syndicate
d | t      | 160f4e1da8a34dd8133eaed103b50601 | 2022-06-23 13:59:48.436918-04 | 2022-06-23 13:29:56.982717-04 | 2022-06-23 13:29:48.415607-04
   858164 |   651275 | More Top Stories - Google News                                | https://news.google.com/news/rss/                                       | syndicate
d | t      | 160f4e1da8a34dd8133eaed103b50601 | 2022-06-23 13:59:48.436918-04 | 2022-06-23 13:29:56.982717-04 | 2022-06-23 13:29:48.415607-04
(47 rows)

mediacloud=# select distinct media_id from feeds where url like 'https://news.google.com/news%' and active = 't' and last_successful_download_time > '2022-06-01'
;
 media_id 
----------
    55491
    59984
   295679
   348022
   361078
   365945
   372638
   375782
   375820
   416841
   440044
   449219
   649508
   651250
   651251
   651252
   651253
   651254
   651255
   651256
   651257
   651258
   651259
   651260
   651261
   651262
   651263
   651264
   651265
   651266
   651267
   651268
   651269
   651270
   651271
   651272
   651273
   651274
   651275
   651276
   651277
   651278
   651279
   651280
   651281
(45 rows)
philbudne commented 2 years ago

Whoops! The above two news.google.com domains were the same one! Here's another:

https://news.google.com/__i/rss/rd/articles/CBMi-AFodHRwczovL21pZGRsZS1lYXN0LW9ubGluZS5jb20vJUQ4JUE3JUQ5JTg0JUQ4JUFDJUQ4JUFGJUQ5JTg0LSVEOSU4QSVEOCVCMSVEOCVBNyVEOSU4MSVEOSU4Mi0lRDklODglRDglQjUlRDklODglRDklODQtJUQ4JUIxJUQ4JUE4JUQ4JUE3JUQ4JUFBLSVEOCVBOCVEOSU4QSVEOSU4OCVEOCVBQS0lRDglQUQlRDklODIlRDklOEElRDklODIlRDklOEElRDglQTclRDglQUEtJUQ4JUE3JUQ5JTg0JUQ5JTg5LSVEOCVBRiVEOCVBOCVEOSU4QdIBAA?oc=5

Turns into: https://middle-east-online.com/%D8%A7%D9%84%D8%AC%D8%AF%D9%84-%D9%8A%D8%B1%D8%A7%D9%81%D9%82-%D9%88%D8%B5%D9%88%D9%84-%D8%B1%D8%A8%D8%A7%D8%AA-%D8%A8%D9%8A%D9%88%D8%AA-%D8%AD%D9%82%D9%8A%D9%82%D9%8A%D8%A7%D8%AA-%D8%A7%D9%84%D9%89-%D8%AF%D8%A8%D9%8A

rahulbot commented 2 years ago

This feels like a well-defined subproblem - how to treat aggregators like Google News. I've relayed that question over to the researcher team with some ideas about paths forward. Feel free to chime in on that thread.

philbudne commented 2 years ago

Things done:

DB queries on running MC system (for articles collected on June 1) to determine exact feeds responsible for URLs not present in backup feed:

pbudne@postgresql:~/rss$ cat art-downloads-by-feed4.psql
with t1 as(select count(1), d.feeds_id
    from downloads_success d
    where d.type = 'content' and
        d.download_time >= '2022-06-01' and
        d.download_time < '2022-06-02'
    group by d.feeds_id
    order by count desc
)
select SUM(t1.count), f.url     -- XXX want to remove http(s)://
from t1, feeds f
where t1.feeds_id = f.feeds_id
group by f.url
order by sum desc;
pbudne@postgresql:~/rss$ head art-downloads-by-feed4.csv 
sum,url
25497,http://www.mk.ru/rss/index.xml
19848,http://www.mk.ru/rss/news/index.xml
17870,https://www.mk.ru/rss/news/index.xml
10207,https://news.google.com/news/rss/
8433,https://www.shawlocal.com/arcio/rss/
6254,http://www.radionacional.com.ar/feed/
5953,http://rss.home.uol.com.br/index.xml
3689,http://www.aif.ru/rss/all.php
3554,https://www.svt.se/rss.xml
philbudne commented 2 years ago

And from there, investigate why articles from those feed URLs (wanted or not) are not being picked up by the backup-rss-fetcher.

Dump of data from basic fetch & parse using requests and feedparse and data transform code from backup-rss-fetcher (no errors):

https://news.google.com/news/rss/
====== d.X:
 bozo False
 entries ...
 feed ...
 headers {}
 encoding utf-8
 version rss20
 namespaces {'media': 'http://search.yahoo.com/mrss/'}
==== d.feed.X:
 generator_detail FeedParserDict {'name': 'NFE/5.0'}
 generator str 'NFE/5.0'
 title str 'Top stories - Google News'
 title_detail FeedParserDict {'type': 'text/plain', 'language': None, 'base': '...
 links list [{'rel': 'alternate', 'type': 'text/html', 'href':...
 link str 'https://news.google.com/?hl=en-US&gl=US&ceid=US:e...
 language str 'en-US'
 publisher str 'news-webmaster@google.com'
 publisher_detail FeedParserDict {'email': 'news-webmaster@google.com'}
 rights str '2022 Google Inc.'
 rights_detail FeedParserDict {'type': 'text/plain', 'language': None, 'base': '...
 updated str 'Thu, 30 Jun 2022 19:05:03 GMT'
 updated_parsed struct_time time.struct_time(tm_year=2022, tm_mon=6, tm_mday=3...
 subtitle str 'Google News'
 subtitle_detail FeedParserDict {'type': 'text/html', 'language': None, 'base': ''...
---
{'url': 'http://google.com/__i/rss/rd/articles/cbmixmh0dhbzoi8vd3d3lndhc2hpbmd0b25wb3n0lmnvbs9jbgltyxrllwvudmlyb25tzw50lziwmjivmdyvmzavzxbhlxn1chjlbwuty291cnqtd2vzdc12axjnaw5pys_saqa?oc=5', 'domain': 'google.com', 'guid': 'CAIiEAYsDObicBMWgqrf6J5D06AqGAgEKg8IACoHCAowjtSUCjC30XQw0fe8Bg', 'published_at': datetime.datetime(2022, 6, 30, 18, 52, 39, tzinfo=tzutc()), 'title': 'Supreme Court ruling West Virginia v. EPA chills Biden climate agenda - The Washington Post', 'normalized_title': 'supreme court ruling west virginia v. epa chills biden climate agenda the washington post', 'normalized_title_hash': '969d315ba7bc6ec1f9c961df0b0c7c89'}
{'url': 'http://google.com/__i/rss/rd/articles/cbmixmh0dhbzoi8vd3d3lm5wci5vcmcvmjaymi8wni8zmc8xmta4nze0mzq1l2tldgfuamktynjvd24tamfja3nvbi1zdxbyzw1llwnvdxj0lw9hdggtc3dlyxjpbmctaw7saqa?oc=5', 'domain': 'google.com', 'guid': 'CAIiEKE18YUh-MQh6WoHMBn3KB0qFwgEKg4IACoGCAow9vBNMK3UCDClvJYH', 'published_at': datetime.datetime(2022, 6, 30, 18, 3, tzinfo=tzutc()), 'title': 'Ketanji Brown Jackson sworn in as first Black woman on the Supreme Court - NPR', 'normalized_title': 'ketanji brown jackson sworn in as first black woman on the supreme court npr', 'normalized_title_hash': '0846a04dd8a96c598fbbf38623ee7b75'}
{'url': 'http://google.com/__i/rss/rd/articles/cbmidwh0dhbzoi8vd3d3lndzai5jb20vyxj0awnszxmvymlkzw4tc2f5cy1ozs1zdxbwb3j0cy1legnlchrpb24tdg8tzmlsawj1c3rlci10by1jb2rpznktcm9llxytd2fkzs1pbnrvlwxhdy0xmty1nju5njeym9ibewh0dhbzoi8vd3d3lndzai5jb20vyw1wl2fydgljbgvzl2jpzgvulxnhexmtagutc3vwcg9ydhmtzxhjzxb0aw9ulxrvlwzpbglidxn0zxitdg8ty29kawz5lxjvzs12lxdhzgutaw50by1syxctmte2nty1otyxmjm?oc=5', 'domain': 'google.com', 'guid': 'CAIiEAIiu6fmy5UgJmVDVz6HJU0qGAgEKg8IACoHCAow1tzJATDnyxUwxsrPBg', 'published_at': datetime.datetime(2022, 6, 30, 18, 32, tzinfo=tzutc()), 'title': 'Biden Says He Supports Exception to Filibuster to Codify Roe v. Wade Into Law - The Wall Street Journal', 'normalized_title': 'biden says he supports exception to filibuster to codify roe v. wade into law the wall street journal', 'normalized_title_hash': '23b3643177b28bcc32f64adc1a2942ff'}
{'url': 'http://google.com/__i/rss/rd/articles/cbmiagh0dhbzoi8vywjjbmv3cy5nby5jb20vvvmvd29tyw4td2fudgvklw11cmrlci1wcm9mzxnzaw9uywwty3ljbglzdc1hcnjlc3rlzc1jb3n0ys1yawnhl3n0b3j5p2lkptg2mda4mjy40gfsahr0chm6ly9hymnuzxdzlmdvlmnvbs9hbxavvvmvd29tyw4td2fudgvklw11cmrlci1wcm9mzxnzaw9uywwty3ljbglzdc1hcnjlc3rlzc1jb3n0ys1yawnhl3n0b3j5p2lkptg2mda4mjy4?oc=5', 'domain': 'google.com', 'guid': 'CBMiaGh0dHBzOi8vYWJjbmV3cy5nby5jb20vVVMvd29tYW4td2FudGVkLW11cmRlci1wcm9mZXNzaW9uYWwtY3ljbGlzdC1hcnJlc3RlZC1jb3N0YS1yaWNhL3N0b3J5P2lkPTg2MDA4MjY40gFsaHR0cHM6Ly9hYmNuZXdzLmdvLmNvbS9hbXAvVVMvd29tYW4td2FudGVkLW11cmRlci1wcm9mZXNzaW9uYWwtY3ljbGlzdC1hcnJlc3RlZC1jb3N0YS1yaWNhL3N0b3J5P2lkPTg2MDA4MjY4', 'published_at': datetime.datetime(2022, 6, 30, 16, 29, 42, tzinfo=tzutc()), 'title': 'Woman wanted in murder of professional cyclist arrested in Costa Rica - ABC News', 'normalized_title': 'woman wanted in murder of professional cyclist arrested in costa rica abc news', 'normalized_title_hash': '0a7d2eb6a26b4f65aa2d1ad1cda1a66c'}
{'url': 'http://google.com/__i/rss/rd/articles/cbmidmh0dhbzoi8vd3d3lm5iy25ld3muy29tl3bvbgl0awnzl3n1chjlbwuty291cnqvc3vwcmvtzs1jb3vydc1hbgxvd3mtymlkzw4tzw5klxrydw1wlwvyys1yzw1haw4tbwv4awnvlxbvbgljes1yy25hmzixodfsaspodhrwczovl3d3dy5uymnuzxdzlmnvbs9uzxdzl2ftcc9yy25hmzixodc?oc=5', 'domain': 'google.com', 'guid': 'CAIiEMigMpbAABACG2LHUl_tiD4qGQgEKhAIACoHCAowvIaCCzDnxf4CMP2F8gU', 'published_at': datetime.datetime(2022, 6, 30, 16, 53, 57, tzinfo=tzutc()), 'title': "Supreme Court allows Biden to end Trump-era 'Remain in Mexico' policy - NBC News", 'normalized_title': "supreme court allows biden to end trump-era 'remain in mexico' policy nbc news", 'normalized_title_hash': 'c51e92764483bc1491c0e92a37b50aad'}
{'url': 'http://google.com/__i/rss/rd/articles/cbmir2h0dhbzoi8vd3d3lm55dgltzxmuy29tlziwmjivmdyvmzavdxmvzmxvcmlkys1hym9ydglvbi1iyw4tymxvy2tlzc5odg1s0gea?oc=5', 'domain': 'google.com', 'guid': 'CAIiEJPSW3UijMHKE0QNBtfICmsqFwgEKg8IACoHCAowjuuKAzCWrzwwt4QY', 'published_at': datetime.datetime(2022, 6, 30, 18, 45, 21, tzinfo=tzutc()), 'title': 'Florida Judge Will Temporarily Block 15-Week Abortion Ban - The New York Times', 'normalized_title': 'florida judge will temporarily block 15-week abortion ban the new york times', 'normalized_title_hash': '01b7b88fadbe192c424c7cb5d4bb8288'}
{'url': 'http://google.com/__i/rss/rd/articles/cbmiswh0dhbzoi8vd3d3lnd2dg0xmy5jb20vyxj0awnszs9kzxb1dgllcy1zag90lwfsywjhbwetymliyi1jb3vudhkvnda0njq1mtbsau1odhrwczovl3d3dy53dnrtmtmuy29tl2ftcc9hcnrpy2xll2rlchv0awvzlxnob3qtywxhymftys1iawjilwnvdw50es80mdq2nduxng?oc=5', 'domain': 'google.com', 'guid': 'CAIiEGGlifvLBkoTVrnXyDK1tQQqMwgEKioIACIQwI8Wot4P9IDiDxcV2kUGOCoUCAoiEMCPFqLeD_SA4g8XFdpFBjgw84yyBw', 'published_at': datetime.datetime(2022, 6, 30, 13, 14, tzinfo=tzutc()), 'title': "2 Bibb County deputies shot, manhunt underway for 'armed and dangerous' suspect - WVTM13 Birmingham", 'normalized_title': "2 bibb county deputies shot, manhunt underway for 'armed and dangerous' suspect wvtm13 birmingham", 'normalized_title_hash': '33f2554f6701dbb5a3b965ef2d80dd40'}
{'url': 'http://google.com/__i/rss/rd/articles/cbmifgh0dhbzoi8vd3d3lmluzgvwzw5kzw50lmnvlnvrl25ld3mvd29ybgqvyw1lcmljyxmvdxmtcg9saxrpy3mvdhj1bxatdg9kyxktd2l0bmvzcy1qyw4tni1jb21taxr0zwutagvhcmluzy1uzxdzlwiymteynje0lmh0bwzsayabahr0chm6ly93d3cuaw5kzxblbmrlbnquy28udwsvbmv3cy93b3jszc9hbwvyawnhcy91cy1wb2xpdgljcy90cnvtcc10b2rhes13axruzxnzlwphbi02lwnvbw1pdhrlzs1ozwfyaw5nlw5ld3mtyjixmti2mtquahrtbd9hbxa?oc=5', 'domain': 'google.com', 'guid': 'CAIiEE0iMMarb_LnGyytLMjBl5sqFggEKg4IACoGCAowzdp7ML-3CTCtyxU', 'published_at': datetime.datetime(2022, 6, 30, 14, 50, 18, tzinfo=tzutc()), 'title': 'Jan 6 hearings – live: Liz Cheney warns of Trump’s ‘domestic threat’ as Melania texts revealed - The Independent', 'normalized_title': 'jan 6 hearings – live liz cheney warns of trump’s ‘domestic threat’ as melania texts revealed the independent', 'normalized_title_hash': 'bcca3d0b0a35bbe57eef92de842803b3'}
{'url': 'http://google.com/__i/rss/rd/articles/cbmilafodhrwczovl3d3dy5uewrhawx5bmv3cy5jb20vbmv3lxlvcmsvbnljlwnyaw1ll255lxdvbwfulwzhdgfsbhktc2hvdc1wdxnoaw5nlxn0cm9sbgvylxvwcgvylwvhc3qtc2lkzs0ymdiymdyzmc1oatzyadnmcxpuywp6cgq1z252ajdochnrys1zdg9yes5odg1s0gea?oc=5', 'domain': 'google.com', 'guid': 'CAIiEJ0NbVI5ObSEaxHO1r_tJ1EqGQgEKhAIACoHCAow1feUCzCqy6oDMPPizQY', 'published_at': datetime.datetime(2022, 6, 30, 14, 37, tzinfo=tzutc()), 'title': "Baby's dad sought for questioning after young mom fatally shot in head pushing stroller on Upper East Side: NYPD sources - New York Daily News", 'normalized_title': "baby's dad sought for questioning after young mom fatally shot in head pushing stroller on upper east side nypd sources new york daily news", 'normalized_title_hash': 'a1f1995d20daff2bdb21bbd89fefc94e'}
{'url': 'http://google.com/__i/rss/rd/articles/cbmizmh0dhbzoi8vd3d3lndhc2hpbmd0b25wb3n0lmnvbs9wb2xpdgljcy8ymdiylza2lzmwl3n1chjlbwuty291cnqtzmvkzxjhbc1lbgvjdglvbnmtc3rhdgutbgvnaxnsyxr1cmvzl9ibaa?oc=5', 'domain': 'google.com', 'guid': 'CAIiEPBIITDMg5_LkY8on7igw1EqGAgEKg8IACoHCAowjtSUCjC30XQwzqe5AQ', 'published_at': datetime.datetime(2022, 6, 30, 17, 48, tzinfo=tzutc()), 'title': "Supreme Court to review state legislatures' power in federal elections - The Washington Post", 'normalized_title': "supreme court to review state legislatures' power in federal elections the washington post", 'normalized_title_hash': '72c644c4d0646db7399591e3bb70656b'}
{'url': 'http://google.com/__i/rss/rd/articles/cbmia2h0dhbzoi8vd3d3lnjldxrlcnmuy29tl3dvcmxkl2v1cm9wzs9ydxnzawetc3rlchmtdxatyxr0ywnrcy11a3jhaw5llwfmdgvylwxhbmrtyxjrlw5hdg8tc3vtbwl0ltiwmjitmdytmzav0gea?oc=5', 'domain': 'google.com', 'guid': 'CBMia2h0dHBzOi8vd3d3LnJldXRlcnMuY29tL3dvcmxkL2V1cm9wZS9ydXNzaWEtc3RlcHMtdXAtYXR0YWNrcy11a3JhaW5lLWFmdGVyLWxhbmRtYXJrLW5hdG8tc3VtbWl0LTIwMjItMDYtMzAv0gEA', 'published_at': datetime.datetime(2022, 6, 30, 17, 49, tzinfo=tzutc()), 'title': 'Russia abandons Snake Island in victory for Ukraine - Reuters.com', 'normalized_title': 'russia abandons snake island in victory for ukraine reuters.com', 'normalized_title_hash': 'ba3b447dab959859e7b122e377fb0237'}
{'url': 'http://google.com/__i/rss/rd/articles/cbmibmh0dhbzoi8vd3d3lmjsb29tymvyzy5jb20vbmv3cy9hcnrpy2xlcy8ymdiylta2ltmwl3vzlxdpbgwtzmfjzs1oawdolwdhcy1wcmljzxmtyxmtbg9uzy1hcy1pdc10ywtlcy1iawrlbi1zyxlz0gea?oc=5', 'domain': 'google.com', 'guid': 'CAIiEOLelQGrCSKPyf59VRvp8i8qGQgEKhAIACoHCAow4uzwCjCF3bsCMIrOrwM', 'published_at': datetime.datetime(2022, 6, 30, 13, 30, 30, tzinfo=tzutc()), 'title': "US Will Face High Gas Prices 'as Long as It Takes,' Biden Says - Bloomberg", 'normalized_title': "us will face high gas prices 'as long as it takes,' biden says bloomberg", 'normalized_title_hash': '108c756d23b0ae22f8f7317b78a48efd'}
{'url': 'http://google.com/__i/rss/rd/articles/cbmix2h0dhbzoi8vd3d3lmnubi5jb20vmjaymi8wni8zmc9jagluys9ob25nlwtvbmcty2hpbmetyw5uaxzlcnnhcnktegktyxjyaxzlcy1pbnrslwhuay9pbmrlec5odg1s0gfjahr0chm6ly9hbxauy25ulmnvbs9jbm4vmjaymi8wni8zmc9jagluys9ob25nlwtvbmcty2hpbmetyw5uaxzlcnnhcnktegktyxjyaxzlcy1pbnrslwhuay9pbmrlec5odg1s?oc=5', 'domain': 'google.com', 'guid': 'CBMiX2h0dHBzOi8vd3d3LmNubi5jb20vMjAyMi8wNi8zMC9jaGluYS9ob25nLWtvbmctY2hpbmEtYW5uaXZlcnNhcnkteGktYXJyaXZlcy1pbnRsLWhuay9pbmRleC5odG1s0gFjaHR0cHM6Ly9hbXAuY25uLmNvbS9jbm4vMjAyMi8wNi8zMC9jaGluYS9ob25nLWtvbmctY2hpbmEtYW5uaXZlcnNhcnkteGktYXJyaXZlcy1pbnRsLWhuay9pbmRleC5odG1s', 'published_at': datetime.datetime(2022, 6, 30, 12, 37, tzinfo=tzutc()), 'title': 'Xi Jinping leaves mainland China for the first time since the beginning of pandemic - CNN', 'normalized_title': 'xi jinping leaves mainland china for the first time since the beginning of pandemic cnn', 'normalized_title_hash': '42d81ea4a3b5f0e3c8bc41d9fca6ede2'}
{'url': 'http://google.com/__i/rss/rd/articles/cbmiawh0dhbzoi8vd3d3lmj1c2luzxnzaw5zawrlci5jb20vchv0aw4td2fybnmtzmlubgfuzc1hbmqtc3dlzgvulwfnywluc3qtag9zdgluzy1uyxrvlwluznjhc3rydwn0dxjlltiwmjitntibbwh0dhbzoi8vd3d3lmj1c2luzxnzaw5zawrlci5jb20vchv0aw4td2fybnmtzmlubgfuzc1hbmqtc3dlzgvulwfnywluc3qtag9zdgluzy1uyxrvlwluznjhc3rydwn0dxjlltiwmjitnj9hbxa?oc=5', 'domain': 'google.com', 'guid': 'CAIiEAPKgHF9dEMaJ2A5rTTVdZ4qLQgEKiUIACIbd3d3LmJ1c2luZXNzaW5zaWRlci5jb20vc2FpKgQICjAMMJD-CQ', 'published_at': datetime.datetime(2022, 6, 30, 9, 17, 11, tzinfo=tzutc()), 'title': 'Putin warns Finland and Sweden against hosting NATO infrastructure - Business Insider', 'normalized_title': 'putin warns finland and sweden against hosting nato infrastructure business insider', 'normalized_title_hash': '1e42b46fc34cc62c04a23df7e9ca5305'}
{'url': 'http://google.com/__i/rss/rd/articles/cbmicgh0dhbzoi8vd3d3lnjldxrlcnmuy29tl21hcmtldhmvzxvyb3bll3vzlwnvbnn1bwvylxnwzw5kaw5nlxjpc2vzlw1vzgvyyxrlbhktaw5mbgf0aw9ulxb1c2hlcy1oawdozxitmjaymi0wni0zmc_saqa?oc=5', 'domain': 'google.com', 'guid': 'CAIiEEeEdjx1aT7iPtDAUD9R2IcqFAgEKg0IACoGCAowt6AMMLAmMOpn', 'published_at': datetime.datetime(2022, 6, 30, 15, 55, tzinfo=tzutc()), 'title': 'U.S. consumer spending, underlying inflation slow in May - Reuters', 'normalized_title': 'u.s. consumer spending, underlying inflation slow in may reuters', 'normalized_title_hash': '0036ddcbac9b006a77968a25d735428f'}
{'url': 'http://google.com/__i/rss/rd/articles/cbmis2h0dhbzoi8vd3d3lmnubi5jb20vdhjhdmvsl2fydgljbguvywlylxryyxzlbc1jagfvcy1tb3jllxrvlwnvbwuvaw5kzxguahrtbnibr2h0dhbzoi8vd3d3lmnubi5jb20vdhjhdmvsl2ftcc9haxitdhjhdmvslwnoyw9zlw1vcmutdg8ty29tzs9pbmrlec5odg1s?oc=5', 'domain': 'google.com', 'guid': 'CBMiS2h0dHBzOi8vd3d3LmNubi5jb20vdHJhdmVsL2FydGljbGUvYWlyLXRyYXZlbC1jaGFvcy1tb3JlLXRvLWNvbWUvaW5kZXguaHRtbNIBR2h0dHBzOi8vd3d3LmNubi5jb20vdHJhdmVsL2FtcC9haXItdHJhdmVsLWNoYW9zLW1vcmUtdG8tY29tZS9pbmRleC5odG1s', 'published_at': datetime.datetime(2022, 6, 30, 13, 28, 30, tzinfo=tzutc()), 'title': 'Why more air travel chaos is on its way - CNN', 'normalized_title': 'why more air travel chaos is on its way cnn', 'normalized_title_hash': '4e29e49aa2cdb02f9d83288104f7755e'}
{'url': 'http://google.com/__i/rss/rd/articles/cbmixwh0dhbzoi8vd3d3lmnuymmuy29tlziwmjivmdyvmzavag91c2luzy1zag9ydgfnzs1zdgfydhmtzwfzaw5nlwfzlwxpc3rpbmdzlxn1cmdllwlulwp1bmuuahrtbnibywh0dhbzoi8vd3d3lmnuymmuy29tl2ftcc8ymdiylza2lzmwl2hvdxnpbmctc2hvcnrhz2utc3rhcnrzlwvhc2luzy1hcy1saxn0aw5ncy1zdxjnzs1pbi1qdw5llmh0bww?oc=5', 'domain': 'google.com', 'guid': 'CAIiEH1Clyg8W9nDMoBhtyWu9r4qGQgEKhAIACoHCAow2Nb3CjDivdcCMJ_5ngY', 'published_at': datetime.datetime(2022, 6, 30, 17, 33, 59, tzinfo=tzutc()), 'title': 'Housing shortage starts easing as listings surge in June - CNBC', 'normalized_title': 'housing shortage starts easing as listings surge in june cnbc', 'normalized_title_hash': '35679e74bbc88934fcd3877557d2a083'}
{'url': 'http://google.com/__i/rss/rd/articles/cbmib2h0dhbzoi8vd3d3lmludmvzdgluzy5jb20vbmv3cy9ly29ub215l2z1dhvyzxmtdhvtymxllw9ulwxhc3qtzgf5lw9mlwetdg9ycmlklwzpcnn0agfszi1vbi1ncm93dggtzmvhcnmtmjg0mjyyonibaa?oc=5', 'domain': 'google.com', 'guid': 'CBMib2h0dHBzOi8vd3d3LmludmVzdGluZy5jb20vbmV3cy9lY29ub215L2Z1dHVyZXMtdHVtYmxlLW9uLWxhc3QtZGF5LW9mLWEtdG9ycmlkLWZpcnN0aGFsZi1vbi1ncm93dGgtZmVhcnMtMjg0MjYyONIBAA', 'published_at': datetime.datetime(2022, 6, 30, 15, 11, tzinfo=tzutc()), 'title': 'Wall Street plunges, S&P 500 set for worst first-half since 1970 By Reuters - Investing.com', 'normalized_title': 'wall street plunges, s 500 set for worst first-half since 1970 by reuters investing.com', 'normalized_title_hash': '6b2eb3359a5d3a93e4c73bb9b8f5c99d'}
{'url': 'http://google.com/__i/rss/rd/articles/cbmixgh0dhbzoi8vd3d3lnrozxzlcmdllmnvbs8ymdiylzyvmzavmjmxodkzotivc2ftc3vuzy1nyw1pbmctahvilxhib3gtc3rhzglhlwx1bmetyxbwcy1zdxbwb3j00gea?oc=5', 'domain': 'google.com', 'guid': 'CAIiENRivVh9bRiT_nrM15jE3hkqFwgEKg4IACoGCAow3O8nMMqOBjCzr7gD', 'published_at': datetime.datetime(2022, 6, 30, 15, 0, tzinfo=tzutc()), 'title': "Samsung's gaming TV hub launches with Xbox, Stadia, and GeForce Now streaming - The Verge", 'normalized_title': "samsung's gaming tv hub launches with xbox, stadia, and geforce now streaming the verge", 'normalized_title_hash': 'c6f3ef3611746b42221286da995c61ab'}
{'url': 'http://google.com/__i/rss/rd/articles/cbmiogh0dhbzoi8vd3d3lmvuz2fkz2v0lmnvbs9izxn0lxntyxj0cghvbmvzlte0mdawndkwmc5odg1s0ge8ahr0chm6ly93d3cuzw5nywrnzxquy29tl2ftcc9izxn0lxntyxj0cghvbmvzlte0mdawndkwmc5odg1s?oc=5', 'domain': 'google.com', 'guid': 'CBMiOGh0dHBzOi8vd3d3LmVuZ2FkZ2V0LmNvbS9iZXN0LXNtYXJ0cGhvbmVzLTE0MDAwNDkwMC5odG1s0gE8aHR0cHM6Ly93d3cuZW5nYWRnZXQuY29tL2FtcC9iZXN0LXNtYXJ0cGhvbmVzLTE0MDAwNDkwMC5odG1s', 'published_at': datetime.datetime(2022, 6, 30, 14, 2, 6, tzinfo=tzutc()), 'title': 'The best smartphones you can buy right now - Engadget', 'normalized_title': 'the best smartphones you can buy right now engadget', 'normalized_title_hash': '4c6cee1bd7245520a1e697adf48b5165'}
{'url': 'http://google.com/__i/rss/rd/articles/cbmivwh0dhbzoi8vd3d3lnbvbhlnb24uy29tlzizmtg5odi2l3nryxrlltqtzwfybhktywnjzxnzlwluc2lkzxitchjvz3jhbs1lbgvjdhjvbmljlwfydhpsawjodhrwczovl3d3dy5wb2x5z29ulmnvbs9wbgf0zm9ybs9hbxavmjmxodk4mjyvc2thdgutnc1lyxjses1hy2nlc3mtaw5zawrlci1wcm9ncmftlwvszwn0cm9uawmtyxj0cw?oc=5', 'domain': 'google.com', 'guid': 'CAIiECQ0CgfMksNrdQ_Seh9Ka2QqGAgEKg8IACoHCAow6IDNATDnu3cwhq6EAw', 'published_at': datetime.datetime(2022, 6, 30, 16, 54, 51, tzinfo=tzutc()), 'title': 'Skate insider program to offer early access to playtests of Skate 4 - Polygon', 'normalized_title': 'skate insider program to offer early access to playtests of skate 4 polygon', 'normalized_title_hash': '0a01b47572e305a61c14c375fecc43e6'}
{'url': 'http://google.com/__i/rss/rd/articles/cbmiv2h0dhbzoi8vd3d3lnrozxzlcmdllmnvbs8ymdiylzyvmzavmjmxodk0ntavy2hyb21llxbhc3n3b3jklw1hbmfnzxitdxbkyxrlcy1pb3mtyw5kcm9pznibaa?oc=5', 'domain': 'google.com', 'guid': 'CAIiEPmGUVsMy8REX-WNsLnVTZ4qFwgEKg4IACoGCAow3O8nMMqOBjCkztQD', 'published_at': datetime.datetime(2022, 6, 30, 16, 0, tzinfo=tzutc()), 'title': 'Chrome password manager update will let you manually add credentials on all platforms - The Verge', 'normalized_title': 'chrome password manager update will let you manually add credentials on all platforms the verge', 'normalized_title_hash': 'e411a35f668dbc0cb18147ab7741174b'}
{'url': 'http://google.com/__i/rss/rd/articles/cbmiy2h0dhbzoi8vcgfnzxnpec5jb20vmjaymi8wni8zmc90cmf2axmtymfya2vycy1kyxvnahrlci1wb3n0cy1uzxctcgljlxdpdggtzgfklwftawqtag9zcgl0ywxpemf0aw9ul9ibz2h0dhbzoi8vcgfnzxnpec5jb20vmjaymi8wni8zmc90cmf2axmtymfya2vycy1kyxvnahrlci1wb3n0cy1uzxctcgljlxdpdggtzgfklwftawqtag9zcgl0ywxpemf0aw9ul2ftcc8?oc=5', 'domain': 'google.com', 'guid': 'CAIiEAJM3ffGcsFENe3a-N5J9WAqGQgEKhAIACoHCAowmID5CjDdtOACMLzWtAU', 'published_at': datetime.datetime(2022, 6, 30, 11, 28, tzinfo=tzutc()), 'title': "Travis Barker's daughter, Alabama, posts new photo with dad amid hospitalization - Page Six", 'normalized_title': "travis barker's daughter, alabama, posts new photo with dad amid hospitalization page six", 'normalized_title_hash': '31528e5395151c83870b7f89d6ed9346'}
{'url': 'http://google.com/__i/rss/rd/articles/cbmicwh0dhbzoi8vd3d3lmxhdgltzxmuy29tl2vudgvydgfpbm1lbnqtyxj0cy9idxnpbmvzcy9zdg9yes8ymdiylta2ltmwl3jhbmrhbgwtzw1tzxr0lwjydwnllxdpbgxpcy1wywnpbm8tbgfsys1rzw500gf7ahr0chm6ly93d3cubgf0aw1lcy5jb20vzw50zxj0ywlubwvudc1hcnrzl2j1c2luzxnzl3n0b3j5lziwmjitmdytmzavcmfuzgfsbc1lbw1ldhqtynj1y2utd2lsbglzlxbhy2luby1sywxhlwtlbnq_x2ftcd10cnvl?oc=5', 'domain': 'google.com', 'guid': 'CBMicWh0dHBzOi8vd3d3LmxhdGltZXMuY29tL2VudGVydGFpbm1lbnQtYXJ0cy9idXNpbmVzcy9zdG9yeS8yMDIyLTA2LTMwL3JhbmRhbGwtZW1tZXR0LWJydWNlLXdpbGxpcy1wYWNpbm8tbGFsYS1rZW500gF7aHR0cHM6Ly93d3cubGF0aW1lcy5jb20vZW50ZXJ0YWlubWVudC1hcnRzL2J1c2luZXNzL3N0b3J5LzIwMjItMDYtMzAvcmFuZGFsbC1lbW1ldHQtYnJ1Y2Utd2lsbGlzLXBhY2luby1sYWxhLWtlbnQ_X2FtcD10cnVl', 'published_at': datetime.datetime(2022, 6, 30, 12, 0, 30, tzinfo=tzutc()), 'title': 'Randall Emmett faces civil fraud claims, abuse allegations - Los Angeles Times', 'normalized_title': 'randall emmett faces civil fraud claims, abuse allegations los angeles times', 'normalized_title_hash': '214a630b8c87efd5d8cd4a4d6d097326'}
{'url': 'http://google.com/__i/rss/rd/articles/cbmiagh0dhbzoi8vd3d3lmvvbmxpbmuuy29tl25ld3mvmtmznjm3ms90aw0tywxszw4tz2l2zxmtaglzlwjydxrhbgx5lwhvbmvzdc10ag91z2h0cy1vbi1uzxctbglnahr5zwfylw1vdmll0ggeawh0dhbzoi8vd3d3lmvvbmxpbmuuy29tl2ftcc9uzxdzlzezmzyznzevdgltlwfsbgvulwdpdmvzlwhpcy1icnv0ywxses1ob25lc3qtdghvdwdodhmtb24tbmv3lwxlc3npz3jlyxrlcmxpz2h0ewvhcmxlc3npz3jlyxrlci1tb3zpzq?oc=5', 'domain': 'google.com', 'guid': 'CAIiEO-_Wx41AS_K--Xb2N-l0mMqGQgEKhAIACoHCAowq_7zCjCt4tQCMPa0pwY', 'published_at': datetime.datetime(2022, 6, 30, 14, 34, tzinfo=tzutc()), 'title': 'Tim Allen Gives His Brutally Honest Thoughts on New Lightyear Movie - E! NEWS', 'normalized_title': 'tim allen gives his brutally honest thoughts on new lightyear movie e! news', 'normalized_title_hash': 'f63349b84ed927fa1b88e5c988ace0ef'}
{'url': 'http://google.com/__i/rss/rd/articles/cbmip2h0dhbzoi8vd3d3lmz0lmnvbs9jb250zw50l2zknzjjmmuxlwi1mtytngnlns1izmy0ltkwmdg5mmfiytdjntibaa?oc=5', 'domain': 'google.com', 'guid': 'CBMiP2h0dHBzOi8vd3d3LmZ0LmNvbS9jb250ZW50L2ZkNzJjMmUxLWI1MTYtNGNlNS1iZmY0LTkwMDg5MmFiYTdjNtIBAA', 'published_at': datetime.datetime(2022, 6, 29, 23, 43, 4, tzinfo=tzutc()), 'title': 'R&B singer R Kelly sentenced to 30 years in prison - Financial Times', 'normalized_title': 'r singer r kelly sentenced to 30 years in prison financial times', 'normalized_title_hash': '1d5620424c320cdb8f08f186c6595db3'}
{'url': 'http://google.com/__i/rss/rd/articles/cbmirgh0dhbzoi8vd3d3lmhvb3bzcnvtb3jzlmnvbs8ymdiylza2lziwmjitbmjhlwzyzwutywdlbmn5lxbyaw1lci5odg1s0gea?oc=5', 'domain': 'google.com', 'guid': 'CBMiRGh0dHBzOi8vd3d3Lmhvb3BzcnVtb3JzLmNvbS8yMDIyLzA2LzIwMjItbmJhLWZyZWUtYWdlbmN5LXByaW1lci5odG1s0gEA', 'published_at': datetime.datetime(2022, 6, 30, 17, 13, 26, tzinfo=tzutc()), 'title': '2022 NBA Free Agency Primer - hoopsrumors.com', 'normalized_title': '2022 nba free agency primer hoopsrumors.com', 'normalized_title_hash': '112f4968dd8a62d7344ef8a0e70d7f9a'}
{'url': 'http://google.com/__i/rss/rd/articles/cbmigwfodhrwczovl3d3dy5jynnzcg9ydhmuy29tl2nvbgxlz2utzm9vdgjhbgwvbmv3cy91c2mtdwnsys1sb29raw5nlxrvlwxlyxzllxbhyy0xmi1mb3itymlnlxrlbi1pbi0ymdi0lxrob3vnac1kzwfslw5vdc15zxqtzmluywxpemvkl9ibhwfodhrwczovl3d3dy5jynnzcg9ydhmuy29tl2nvbgxlz2utzm9vdgjhbgwvbmv3cy91c2mtdwnsys1sb29raw5nlxrvlwxlyxzllxbhyy0xmi1mb3itymlnlxrlbi1pbi0ymdi0lxrob3vnac1kzwfslw5vdc15zxqtzmluywxpemvkl2ftcc8?oc=5', 'domain': 'google.com', 'guid': 'CAIiEFQzClcZqr5KA0mkhURT5ckqFggEKg4IACoGCAow5tYTMODEAjCSuwQ', 'published_at': datetime.datetime(2022, 6, 30, 18, 14, tzinfo=tzutc()), 'title': 'USC, UCLA looking to leave Pac-12 for Big Ten in 2024, though deal not yet finalized - CBS Sports', 'normalized_title': 'usc, ucla looking to leave pac-12 for big ten in 2024, though deal not yet finalized cbs sports', 'normalized_title_hash': '2eea93887dd71cc3c066423897b31c6c'}
{'url': 'http://google.com/__i/rss/rd/articles/cbmic2h0dhbzoi8vd3d3lmvzcg4uy28udwsvbmjhl3n0b3j5l18vawqvmzqxnza2mjqvy2hhcmxvdhrllwhvcm5ldhmtbwlszxmtynjpzgdlcy1hcnjlc3rlzc1sb3mtyw5nzwxlcy1ldmutznjlzs1hz2vuy3nsayabahr0chm6ly93d3cuzxnwbi5jby51ay9uymevc3rvcnkvxy9pzc8znde3mdyync9jagfybg90dgutag9ybmv0cy1tawxlcy1icmlkz2vzlwfycmvzdgvklwxvcy1hbmdlbgvzlwv2zs1mcmvllwfnzw5jet9wbgf0zm9ybt1hbxa?oc=5', 'domain': 'google.com', 'guid': 'CBMic2h0dHBzOi8vd3d3LmVzcG4uY28udWsvbmJhL3N0b3J5L18vaWQvMzQxNzA2MjQvY2hhcmxvdHRlLWhvcm5ldHMtbWlsZXMtYnJpZGdlcy1hcnJlc3RlZC1sb3MtYW5nZWxlcy1ldmUtZnJlZS1hZ2VuY3nSAYABaHR0cHM6Ly93d3cuZXNwbi5jby51ay9uYmEvc3RvcnkvXy9pZC8zNDE3MDYyNC9jaGFybG90dGUtaG9ybmV0cy1taWxlcy1icmlkZ2VzLWFycmVzdGVkLWxvcy1hbmdlbGVzLWV2ZS1mcmVlLWFnZW5jeT9wbGF0Zm9ybT1hbXA', 'published_at': datetime.datetime(2022, 6, 30, 14, 27, 37, tzinfo=tzutc()), 'title': "Charlotte Hornets' Miles Bridges arrested in Los Angeles on eve of free agency - ESPN.co.uk", 'normalized_title': "charlotte hornets' miles bridges arrested in los angeles on eve of free agency espn.co.uk", 'normalized_title_hash': '883361f3025808e1ccd895bf599d488c'}
{'url': 'http://google.com/__i/rss/rd/articles/cbmiw2h0dhbzoi8vbnlwb3n0lmnvbs8ymdiylza2lzmwl2rlam91bnrllw11cnjhes10cmfkzs1kcmf3cy10d2l0dgvylxjlywn0aw9ulwzyb20tdhjhzs15b3vuzy_sav9odhrwczovl255cg9zdc5jb20vmjaymi8wni8zmc9kzwpvdw50zs1tdxjyyxktdhjhzgutzhjhd3mtdhdpdhrlci1yzwfjdglvbi1mcm9tlxryywutew91bmcvyw1wlw?oc=5', 'domain': 'google.com', 'guid': 'CAIiECKZzA9BT-0i7_zPCkEUNf8qGAgEKg8IACoHCAowhK-LAjD4ySww-9S0BQ', 'published_at': datetime.datetime(2022, 6, 30, 14, 41, tzinfo=tzutc()), 'title': "Trae Young loving the Dejounte Murray trade: 'S--t just got real' - New York Post", 'normalized_title': "trae young loving the dejounte murray trade 's--t just got real' new york post", 'normalized_title_hash': '211a3e23c60c4e1e2f3c615ae794af62'}
{'url': 'http://google.com/__i/rss/rd/articles/cbmiwwh0dhbzoi8vd3d3lmnubi5jb20vmjaymi8wni8zmc9hc2lhl2fuy2llbnqtcgfuzgetymftym9vlxrodw1ilxnpehrolwrpz2l0lxnjbi9pbmrlec5odg1s0gfdahr0chm6ly9hbxauy25ulmnvbs9jbm4vmjaymi8wni8zmc9hc2lhl2fuy2llbnqtcgfuzgetymftym9vlxrodw1ilxnpehrolwrpz2l0lxnjbi9pbmrlec5odg1s?oc=5', 'domain': 'google.com', 'guid': 'CBMiWWh0dHBzOi8vd3d3LmNubi5jb20vMjAyMi8wNi8zMC9hc2lhL2FuY2llbnQtcGFuZGEtYmFtYm9vLXRodW1iLXNpeHRoLWRpZ2l0LXNjbi9pbmRleC5odG1s0gFdaHR0cHM6Ly9hbXAuY25uLmNvbS9jbm4vMjAyMi8wNi8zMC9hc2lhL2FuY2llbnQtcGFuZGEtYmFtYm9vLXRodW1iLXNpeHRoLWRpZ2l0LXNjbi9pbmRleC5odG1s', 'published_at': datetime.datetime(2022, 6, 30, 15, 2, tzinfo=tzutc()), 'title': 'Pandas evolved their most perplexing feature at least 6 million years ago - CNN', 'normalized_title': 'pandas evolved their most perplexing feature at least 6 million years ago cnn', 'normalized_title_hash': 'eaab18a9488cad2f893970cd7d0b063c'}
{'url': 'http://google.com/__i/rss/rd/articles/cbmiwgh0dhbzoi8vd3d3lmnubi5jb20vmjaymi8wni8zmc9jagluys9jagluys10awfud2vultetbwfycy1pbwfnzxmtaw50bc1obmstc2nul2luzgv4lmh0bwzsavxodhrwczovl2ftcc5jbm4uy29tl2nubi8ymdiylza2lzmwl2noaw5hl2noaw5hlxrpyw53zw4tms1tyxjzlwltywdlcy1pbnrslwhuay1zy24vaw5kzxguahrtba?oc=5', 'domain': 'google.com', 'guid': 'CBMiWGh0dHBzOi8vd3d3LmNubi5jb20vMjAyMi8wNi8zMC9jaGluYS9jaGluYS10aWFud2VuLTEtbWFycy1pbWFnZXMtaW50bC1obmstc2NuL2luZGV4Lmh0bWzSAVxodHRwczovL2FtcC5jbm4uY29tL2Nubi8yMDIyLzA2LzMwL2NoaW5hL2NoaW5hLXRpYW53ZW4tMS1tYXJzLWltYWdlcy1pbnRsLWhuay1zY24vaW5kZXguaHRtbA', 'published_at': datetime.datetime(2022, 6, 30, 6, 12, tzinfo=tzutc()), 'title': "China's Mars probe has photographed the entire red planet - CNN", 'normalized_title': "china's mars probe has photographed the entire red planet cnn", 'normalized_title_hash': 'd9d1449bc80642a9c0b9e6b615d21373'}
{'url': 'http://google.com/__i/rss/rd/articles/cbmiswh0dhbzoi8vc3bhy2vuzxdzlmnvbs9uyxnhlxbyzxbhcmvzlxrvlxjlbgvhc2utzmlyc3qtandzdc1zy2llbmnllwltywdlcy_saqa?oc=5', 'domain': 'google.com', 'guid': 'CBMiSWh0dHBzOi8vc3BhY2VuZXdzLmNvbS9uYXNhLXByZXBhcmVzLXRvLXJlbGVhc2UtZmlyc3QtandzdC1zY2llbmNlLWltYWdlcy_SAQA', 'published_at': datetime.datetime(2022, 6, 30, 1, 35, 33, tzinfo=tzutc()), 'title': 'NASA prepares to release first JWST science images - SpaceNews', 'normalized_title': 'nasa prepares to release first jwst science images spacenews', 'normalized_title_hash': 'ce62d25761fa4228ac2494fccedc2eb4'}
{'url': 'http://google.com/__i/rss/rd/articles/cbmixgh0dhbzoi8vyxjzdgvjag5py2euy29tl3njawvuy2uvmjaymi8wni9uyxnhlwfpbxmtdg8tbgf1bmnolxrozs1zbhmtcm9ja2v0lwlulwp1c3qtmi1tb250ahmv0gfiahr0chm6ly9hcnn0zwnobmljys5jb20vc2npzw5jzs8ymdiylza2l25hc2etywltcy10by1syxvuy2gtdghllxnscy1yb2nrzxqtaw4tanvzdc0ylw1vbnrocy8_yw1wpte?oc=5', 'domain': 'google.com', 'guid': 'CBMiXGh0dHBzOi8vYXJzdGVjaG5pY2EuY29tL3NjaWVuY2UvMjAyMi8wNi9uYXNhLWFpbXMtdG8tbGF1bmNoLXRoZS1zbHMtcm9ja2V0LWluLWp1c3QtMi1tb250aHMv0gFiaHR0cHM6Ly9hcnN0ZWNobmljYS5jb20vc2NpZW5jZS8yMDIyLzA2L25hc2EtYWltcy10by1sYXVuY2gtdGhlLXNscy1yb2NrZXQtaW4tanVzdC0yLW1vbnRocy8_YW1wPTE', 'published_at': datetime.datetime(2022, 6, 28, 21, 39, 55, tzinfo=tzutc()), 'title': 'NASA aims to launch the SLS rocket in just 2 months - Ars Technica', 'normalized_title': 'nasa aims to launch the sls rocket in just 2 months ars technica', 'normalized_title_hash': '9c2ef5bfaa8e503719f5db26fe99e8a3'}
{'url': 'http://google.com/__i/rss/rd/articles/cbmiywh0dhbzoi8vd3d3lmzveg5ld3muy29tl2hlywx0ac93ag8td2fybnmtc3vzdgfpbmvklxryyw5zbwlzc2lvbi1tb25rzxlwb3gtcmlza3mtdnvsbmvyywjszs1ncm91chpsawvodhrwczovl3d3dy5mb3huzxdzlmnvbs9ozwfsdggvd2hvlxdhcm5zlxn1c3rhaw5lzc10cmfuc21pc3npb24tbw9ua2v5cg94lxjpc2tzlxz1bg5lcmfibgutz3jvdxbzlmftca?oc=5', 'domain': 'google.com', 'guid': 'CAIiECSYSYHKh192pIVwg88Qzc8qGQgEKhAIACoHCAowwL2ICzCckocDMKOkvwc', 'published_at': datetime.datetime(2022, 6, 30, 16, 6, 56, tzinfo=tzutc()), 'title': "WHO warns 'sustained transmission' of monkeypox risks vulnerable groups - Fox News", 'normalized_title': "who warns 'sustained transmission' of monkeypox risks vulnerable groups fox news", 'normalized_title_hash': '6b4562b7839b54177117fa496b196922'}
{'url': 'http://google.com/__i/rss/rd/articles/cbmidwh0dhbzoi8vc2npdgvjagrhawx5lmnvbs9hbwvyawnhbi1ozwfydc1hc3nvy2lhdglvbi1zbgvlcc1kdxjhdglvbi1pcy1lc3nlbnrpywwty29tcg9uzw50lwzvci1ozwfydc1hbmqtynjhaw4tagvhbhrol9ibaa?oc=5', 'domain': 'google.com', 'guid': 'CBMidWh0dHBzOi8vc2NpdGVjaGRhaWx5LmNvbS9hbWVyaWNhbi1oZWFydC1hc3NvY2lhdGlvbi1zbGVlcC1kdXJhdGlvbi1pcy1lc3NlbnRpYWwtY29tcG9uZW50LWZvci1oZWFydC1hbmQtYnJhaW4taGVhbHRoL9IBAA', 'published_at': datetime.datetime(2022, 6, 30, 3, 3, 39, tzinfo=tzutc()), 'title': 'American Heart Association: Sleep Duration Is Essential Component for Heart and Brain Health - SciTechDaily', 'normalized_title': 'american heart association sleep duration is essential component for heart and brain health scitechdaily', 'normalized_title_hash': 'de8f3be47d136e5b03925e1aa3e8c602'}
{'url': 'http://google.com/__i/rss/rd/articles/cbmiu2h0dhbzoi8vd3d3lnlhag9vlmnvbs9lbnrlcnrhaw5tzw50l3utbm93lw9mzmvylxzhy2npbmf0aw9ucy1hz2fpbnn0lte3mtexmjcxny5odg1s0gfbahr0chm6ly93d3cuewfob28uy29tl2ftcgh0bwwvzw50zxj0ywlubwvudc91lw5vdy1vzmzlci12ywnjaw5hdglvbnmtywdhaw5zdc0xnzexmti3mtcuahrtba?oc=5', 'domain': 'google.com', 'guid': 'CBMiU2h0dHBzOi8vd3d3LnlhaG9vLmNvbS9lbnRlcnRhaW5tZW50L3Utbm93LW9mZmVyLXZhY2NpbmF0aW9ucy1hZ2FpbnN0LTE3MTExMjcxNy5odG1s0gFbaHR0cHM6Ly93d3cueWFob28uY29tL2FtcGh0bWwvZW50ZXJ0YWlubWVudC91LW5vdy1vZmZlci12YWNjaW5hdGlvbnMtYWdhaW5zdC0xNzExMTI3MTcuaHRtbA', 'published_at': datetime.datetime(2022, 6, 29, 17, 11, 12, tzinfo=tzutc()), 'title': 'U.S. Will Now Offer Vaccinations Against Monkeypox to Anyone Who May Have Been Exposed to the Virus - Yahoo Entertainment', 'normalized_title': 'u.s. will now offer vaccinations against monkeypox to anyone who may have been exposed to the virus yahoo entertainment', 'normalized_title_hash': 'ae0bf204e53235d8de73271b58912d2c'}
{'url': 'http://google.com/__i/rss/rd/articles/cbmiawh0dhbzoi8vd3rvcc5jb20vdmlyz2luawevmjaymi8wni9hzgrpdglvbmfslxbyzxn1bwvklwnhc2vzlw9mlw1vbmtlexbvec1pzgvudglmawvklwlulxbhcnrzlw9mlxzpcmdpbmlhl9ibaa?oc=5', 'domain': 'google.com', 'guid': 'CAIiEAAv25P0UhIJ_EKXCdhiYkgqGQgEKhAIACoHCAowjaWHCzCr3IUDMLz0nAY', 'published_at': datetime.datetime(2022, 6, 30, 2, 47, 36, tzinfo=tzutc()), 'title': "Additional 'presumed cases' of monkeypox identified in parts of Virginia - WTOP", 'normalized_title': "additional 'presumed cases' of monkeypox identified in parts of virginia wtop", 'normalized_title_hash': 'd6b2aef8811dbcae1345404d2bf1ff36'}
rahulbot commented 1 year ago

That list of domains is super useful. I poked at just two of them to see if that revealed anything useful:

mk.ru

I took a minute to dive into http://www.mk.ru/rss/index.xml as a sample to understand. Some notes:

shawlocal

I did the same with https://www.shawlocal.com/arcio/rss/:

rahulbot commented 1 year ago

Fixing another bug related to title-based deduplication (in d50913633d83c7aa7abf8ac01f553241fd873e67) may help here too. It was over aggresivley ignoring duplicated titles across all media sources (instead of just within one source).

rahulbot commented 1 year ago

Pulled in 603,247 stories with pub dates of 7/6/22 🎉 That's only about 30% lower than the production server - so maybe fixing those 2 bugs noted above were the solution here? 🤞🏽 We should check again in a few days.

rahulbot commented 1 year ago

Still no luck. I rewrote the central fetching task to make it easier to read, and try and streamline DB handle usage. I'm resetting the last_fetch_failures to 0 and trying again to determine if there are other biz logic bugs that could be causing a lower total ingest volume. If nothing changes then we need to do another dive comparing production day of ingest vs. backup day of ingest I think.

Stories fetched by day

As you can see below, we have occasional spikes in story fetch volume, but those don't correlate into more stories by day; perhaps because they are duplicates.

rss-fetcher_…__4__-_JupyterLab

Stories by publication day

Here you can see ingest by publication day is fairly steady. The dip in avg volume in late May is due to the enhanced de-duplication added in #5 & #6.

rss-fetcher_…__4__-_JupyterLab
rahulbot commented 4 months ago

Closing - I think we're well past this.