Closed rahulbot closed 4 months ago
For investigating, feeds are stored in S3 buckets with dates encoded in filenames like this:
Average daily volume over the last week or two:
Added more feeds in 0396e12. This time it is a dump that checked last_successfull_download
, not last_new_story
.
reposted from email:
most of the backup RSS file is a single line!
lines words chars
4175211 5280699 315561091 prod/mc-2022-06-01.rss
16 1443161 98355897 backup/mc-2022-06-01.rss
Looking at files from 2022-06-01. For each feed file I extracted just the text between and tags into a file named "urls":
1043783 1043783 119202388 prod/urls
262124 262124 28563312 backup/urls
The feeding both url files thru "sort -u" to get unique URLs shows there are duplicate urls in both files:
640483 640483 71922300 prod/urls.uniq
256867 256867 27982221 backup/urls.uniq
The "domains" file discarded any http*:// prefix, and the remainder of the URL past the domain:
1043783 1041765 17769530 prod/domains
262124 261693 3947539 backup/domains
The count of unique domains (sort -u on domains) in each file isn't very different:
16071 16070 312962 prod/domains.uniq
16872 16871 358861 backup/domains.uniq
feeding each domain file thru the pipeline
sort | uniq -c | sort -rn
gives domain frequencies.
Here are the top 20 for each:
$ head -20 prod/domain-counts backup/domain-counts
==> prod/domain-counts <==
20283 news.google.com
18695 www.mk.ru
18251 www.bizjournals.com
8769 dnrtv.org.vn
8433 www.shawlocal.com
6298 www.radionacional.com.ar
5840 aif.ru
5487 www.dailytelegraph.com.au
5119 www.stuff.co.nz
4972 timesofindia.indiatimes.com
4687 globalnews.ca
4431 bnr.bg
4386 www.bignewsnetwork.com
4088 www.svt.se
4083 www.washingtonpost.com
3925 udn.com
3881 blog.goo.ne.jp
3818 www.chinanews.com
3814 www.iol.co.za
3611 www.heraldsun.com.au
==> backup/domain-counts <==
2730 oglecountylife.com
2481 latestnigeriannews.com
1699 amarujala.com
1499 google.com
1205 uol.com.br
1126 bhaskar.com
961 udn.com
956 timesofindia.indiatimes.com
920 irna.ir
838 pantip.com
682 iribnews.ir
674 haberler.com
664 madhyamam.com
629 globenewswire.com
625 tw.news.yahoo.com
595 lanacion.com.ar
590 bnr.bg
587 finanznachrichten.de
583 navbharattimes.indiatimes.com
580 menafn.com
Looking at the top six domains in the production feed, NONE of them appear (at all) in the backup/domains
Yields a redirect to: https://g1.globo.com/df/distrito-federal/noticia/2022/06/01/nova-cnh-veja-como-sera-novo-modelo-da-habilitacao.ghtml
Redirects to: https://g1.globo.com/df/distrito-federal/noticia/2022/06/01/nova-cnh-veja-como-sera-novo-modelo-da-habilitacao.ghtml
Rahul replied:
interesting stuff. Nice digging. So it sounds like a few top level things you’re seeing: 1) our current system pulls in lots more duplicate URLs 2) the list of domains in both are pretty similar 3) there are some domains missing in the backup that are pulling a lot of stories
Is that about right?
Google news RSS links are weird. Perhaps ask Emily and Rebecca what they’ve seen with Google News RSS feeds or URLs and if there is a standard way we deal with them. I wonder what media source that first example was associated with. For instance, we have a Globo source (60427https://sources.mediacloud.org/#/sources/60427/feeds) with a handful of feeds that look active. Are those feeds in the backups database?
Some more queries, looking at globo:
mediacloud=# select distinct media_id from stories where collect_date > '2022-06-22' and url like '%globo.com%';
media_id
----------
40252
654106
83352
97032
83399
60347
101943
60427
65275
(9 rows)
mediacloud=# select count(1), media_id from stories where collect_date > '2022-06-22' and url like '%globo.com%' group by media_id;
count | media_id
-------+----------
5 | 40252
56 | 654106
75 | 83352
358 | 97032
359 | 83399
5 | 60347
220 | 101943
1443 | 60427
1 | 65275
(9 rows)
60427 (suggested by Rahul) is the largest.
mediacloud=# select * from media where media_id = 60427;
media_id | url | normalized_url | name | full_text_rss | foreign_rss_links | dup_media_id | is_not_dup | content_delay | editor_notes | public_notes | is_monitored
----------+----------------------+----------------------+-------+---------------+-------------------+--------------+------------+---------------+--------------+--------------+--------------
60427 | http://g1.globo.com/ | http://g1.globo.com/ | Globo | f | f | | t | | |
| t
(1 row)
mediacloud=# select * from feeds where media_id = 60427;
feeds_id | media_id | name | url | type
| active | last_checksum | last_attempted_download_time | last_successful_download_time | last_new_story_time
----------+----------+-------------------------------------------+---------------------------------------------------------------------------------------------+---------
---+--------+----------------------------------+-------------------------------+-------------------------------+-------------------------------
951383 | 60427 | Spider Feed | http://g1.globo.com/#spiderfeed | syndicat
ed | f | | | |
137181 | 60427 | Globo | http://g1.globo.com/ | web_page
| f | | 2019-04-01 00:19:45.549935-04 | 2019-04-01 00:33:11-04 | 2019-04-01 00:19:45.549935-04
113355 | 60427 | G1 | http://g1.globo.com/dynamo/rss2.xml | syndicat
ed | t | 87725197ed4caa7bcc1486c0568be589 | 2022-06-23 13:59:48.436918-04 | 2022-06-23 13:29:56.982717-04 | 2022-06-23 13:29:48.415607-04
1606188 | 60427 | G1 – Semana Pop | https://audio.globoradio.globo.com/podcast/feed/539/semana-pop | syndicated
| t | 921b766d5751972f07f84547bf45e6da | 2022-06-17 20:27:35.337227-04 | 2022-06-17 20:28:15.087585-04 | 2020-10-31 12:43:31.867907-04
1606191 | 60427 | G1 - Livro Falado | https://audio.globoradio.globo.com/podcast/feed/592/g1-livro-falado | syndicat
ed | t | 8995124c813ac0ff30bef87463b54c4a | 2022-06-17 20:27:35.337227-04 | 2022-06-17 20:28:15.087585-04 | 2019-09-09 22:23:09.784729-04
1606189 | 60427 | G1 – Educação Financeira | https://audio.globoradio.globo.com/podcast/feed/531/educacao-financeira | syndicated | t
| 2974511423b7ace55c8bd7881aba8583 | 2022-06-20 04:24:18.2106-04 | 2022-06-20 04:24:22.028323-04 | 2022-06-14 20:59:14.929969-04
1049867 | 60427 | Globo | https://g1.globo.com/ | web_page
| t | | 2022-06-17 20:27:35.337227-04 | 2022-06-17 20:28:15.087585-04 | 2022-06-17 20:27:35.337227-04
1606190 | 60427 | G1 ouviu - seu guia de novidades musicais | https://audio.globoradio.globo.com/podcast/feed/537/g1-ouviu-seu-guia-de-novidades-musicais | syndicat
ed | t | b8c109da3ec2cd14d47b8a8192e7043b | 2022-06-21 15:59:30.927874-04 | 2022-06-21 15:59:38.795003-04 | 2022-06-19 01:27:31.091694-04
(8 rows)
Some poking a news.google.com feeds:
mediacloud=# select count(1), media_id from stories where collect_date > '2022-06-22' and url like '%news.google.com%' group by media_id;
count | media_id
-------+----------
653 | 651253
679 | 651280
884 | 651262
460 | 651272
625 | 348022
264 | 449219
382 | 361078
766 | 651261
621 | 651270
729 | 651263
10 | 295679
738 | 651257
751 | 651252
79 | 59102
788 | 651277
663 | 651258
689 | 651251
729 | 651273
177 | 440044
616 | 651265
647 | 651269
688 | 651260
870 | 651255
690 | 651279
779 | 651250
669 | 651271
854 | 651268
943 | 365945
478 | 375820
672 | 651278
603 | 651266
720 | 649508
797 | 59984
799 | 375782
333 | 27642
741 | 651259
788 | 372638
607 | 651267
619 | 651254
770 | 416841
752 | 651281
302 | 390374
565 | 651256
562 | 651264
567 | 651274
202 | 25927
914 | 651276
1094 | 651275
(48 rows)
mediacloud=# select * from media where media_id = 651275
mediacloud-# ;
media_id | url | normalized_url | name | full_text_rss | foreign_rss_links | dup_media_id | is_not_dup | content_delay | editor_notes | public_notes | is_monitored
----------+--------------------------------------+--------------------------------+----------------------+---------------+-------------------+--------------+------------+---------------+--------------+--------------+--------------
651275 | https://news.google.com/news/?ned=kr | http://google.com/news/?ned=kr | Google - South Korea | | t | | | 0 | | | t
(1 row)
mediacloud=# select * from media where url like 'https://news.google.com/news%';
media_id | url | normalized_url | name | full_text_rss | for
eign_rss_links | dup_media_id | is_not_dup | content_delay | editor_notes | public_notes | is_monitored
----------+---------------------------------------------------+---------------------------------------------+----------------------------------------+---------------+----
---------------+--------------+------------+---------------+--------------+--------------+--------------
651251 | https://news.google.com/news/?ned=bg_bg | http://google.com/news/?ned=bg_bg | Google - Bulgaria | | t
| | | 0 | | | f
651258 | https://news.google.com/news/?ned=hi_in | http://google.com/news/?ned=hi_in | Google (Hindi) | | t
| | | 0 | | | f
651259 | https://news.google.com/news/?ned=ml_in | http://google.com/news/?ned=ml_in | Google (Malayalam) | | t
| | | 0 | | | f
651260 | https://news.google.com/news/?ned=ta_in | http://google.com/news/?ned=ta_in | Google (Tamil) | | t
| | | 0 | | | f
651261 | https://news.google.com/news/?ned=te_in | http://google.com/news/?ned=te_in | Google (Telugu) | | t
| | | 0 | | | f
449219 | https://news.google.com/news/rss/?ned=tw&hl=zh-TW | http://google.com/news/rss/?ned=tw&hl=zh-tw | Google News Taiwan | | t
| | | 0 | | | t
651278 | https://news.google.com/news/?ned=ru_ua | http://google.com/news/?ned=ru_ua | Google (Russian) | | t
| | | 0 | | | f
651281 | https://news.google.com/news/?ned=vi_vn | http://google.com/news/?ned=vi_vn | Google - Vietnam | | t
| | | 0 | | | t
651279 | https://news.google.com/news/?ned=uk_ua | http://google.com/news/?ned=uk_ua | Google (Ukrainian) | | t
| | | 0 | | | f
651272 | https://news.google.com/news/?ned=ar_sa | http://google.com/news/?ned=ar_sa | Google - Saudi Arabia | | t
| | | 0 | | | t
651274 | https://news.google.com/news/?ned=sk_sk | http://google.com/news/?ned=sk_sk | Google - Slovakia | | t
| | | 0 | | | t
651263 | https://news.google.com/news/?ned=iw_il | http://google.com/news/?ned=iw_il | Google (Hebrew) | | t
| | | 0 | | | t
416841 | https://news.google.com/news?ned=cn | http://google.com/news?ned=cn | Google News China | | t
| | | 0 | | | t
651250 | https://news.google.com/news/?ned=pt-BR_br | http://google.com/news/?ned=pt-br_br | Google - Brazil | | t
| | | 0 | | | t
651256 | https://news.google.com/news/?ned=hk | http://google.com/news/?ned=hk | Google - Hong Kong | | t
| | | 0 | | | t
649508 | https://news.google.com/news/?ned=bn_bd | http://google.com/news/?ned=bn_bd | Google - Bangladesh | | t
| | | 0 | | | t
651254 | https://news.google.com/news/?ned=ar_eg | http://google.com/news/?ned=ar_eg | Google - Egypt | | t
| | | 0 | | | t
651255 | https://news.google.com/news/?ned=el_gr | http://google.com/news/?ned=el_gr | Google - Greece | | t
| | | 0 | | | t
651253 | https://news.google.com/news/?ned=cs_cz | http://google.com/news/?ned=cs_cz | Google - Czech Republic | | t
| | | 0 | | | t
651269 | https://news.google.com/news/?ned=pt-PT_pt | http://google.com/news/?ned=pt-pt_pt | Google - Portugal | | t
| | | 0 | | | t
651275 | https://news.google.com/news/?ned=kr | http://google.com/news/?ned=kr | Google - South Korea | | t
| | | 0 | | | t
651268 | https://news.google.com/news/?ned=pl_pl | http://google.com/news/?ned=pl_pl | Google - Poland | | t
| | | 0 | | | t
651252 | https://news.google.com/news/?ned=cn | http://google.com/news/?ned=cn | Google - China (National) | | t
| | | 0 | | | t
651265 | https://news.google.com/news/?ned=ar_lb | http://google.com/news/?ned=ar_lb | Google - Lebanon | | t
| | | 0 | | | t
651276 | https://news.google.com/news/?ned=tw | http://google.com/news/?ned=tw | Google - Taiwan | | t
| | | 0 | | | t
651271 | https://news.google.com/news/?ned=ru_ru | http://google.com/news/?ned=ru_ru | Google - Russia | | t
| | | 0 | | | f
651270 | https://news.google.com/news/?ned=ro_ro | http://google.com/news/?ned=ro_ro | Google - Romania | | t
| | | 0 | | | f
651257 | https://news.google.com/news/?ned=hu_hu | http://google.com/news/?ned=hu_hu | Google - Hungary | | t
| | | 0 | | | f
651267 | https://news.google.com/news/?ned=ar_me | http://google.com/news/?ned=ar_me | Google - Near and Middle East Regional | | t
| | | 0 | | | f
651273 | https://news.google.com/news/?ned=sr_rs | http://google.com/news/?ned=sr_rs | Google - Serbia | | t
| | | 0 | | | f
651277 | https://news.google.com/news/?ned=tr_tr | http://google.com/news/?ned=tr_tr | Google - Turkey | | t
| | | 0 | | | f
651264 | https://news.google.com/news/?ned=lv_lv | http://google.com/news/?ned=lv_lv | Google - Latvia | | t
| | | 0 | | | f
651266 | https://news.google.com/news/?ned=lt_lt | http://google.com/news/?ned=lt_lt | Google - Lithuania | | t
| | | 0 | | | t
651280 | https://news.google.com/news/?ned=ar_ae | http://google.com/news/?ned=ar_ae | Google - United Arab Emirates | | t
| | | 0 | | | t
651262 | https://news.google.com/news/?ned=id_id | http://google.com/news/?ned=id_id | Google - Indonesia | | t
| | | 0 | | | t
(35 rows)
mediacloud=# select * from feeds where url like 'https://news.google.com/news%';
feeds_id | media_id | name | url | type
| active | last_checksum | last_attempted_download_time | last_successful_download_time | last_new_story_time
----------+----------+---------------------------------------------------------------+-------------------------------------------------------------------------+----------
--+--------+----------------------------------+-------------------------------+-------------------------------+-------------------------------
858131 | 375820 | More Top Stories - Google News | https://news.google.com/news/rss/ | syndicate
d | f | 8751f64868a480dc9d5aa18915cccdde | 2018-03-28 12:09:40.16676-04 | 2018-03-28 13:08:17-04 | 2018-03-28 12:09:40.16676-04
635665 | 295679 | More Top Stories - Google News | https://news.google.com/news/rss/ | syndicate
d | f | 128d5ff62cd3c67dd015488b1e58697d | 2018-03-28 12:09:40.16676-04 | 2018-03-28 13:15:43-04 | 2018-03-28 12:09:40.16676-04
858151 | 651265 | More Top Stories - Google News | https://news.google.com/news/rss/ | syndicate
d | t | 37c106ba0b42aa012ebe1380f3d3ad36 | 2022-06-23 13:59:48.436918-04 | 2022-06-23 13:29:56.982717-04 | 2022-06-23 13:29:48.415607-04
619144 | 449219 | Google News Taiwan | https://news.google.com/news/rss/?ned=tw&hl=zh-TW/data/rss | syndicate
d | t | 23cc660e1753546e174271379fdcfabe | 2022-06-18 21:27:26.04296-04 | 2022-06-18 21:27:30.30171-04 | 2022-06-08 07:57:54.41743-04
858150 | 651263 | More Top Stories - Google News | https://news.google.com/news/rss/ | syndicate
d | t | 37c106ba0b42aa012ebe1380f3d3ad36 | 2022-06-23 13:59:48.436918-04 | 2022-06-23 13:29:56.982717-04 | 2022-06-23 13:29:48.415607-04
858146 | 651260 | More Top Stories - Google News | https://news.google.com/news/rss/ | syndicate
d | t | 37c106ba0b42aa012ebe1380f3d3ad36 | 2022-06-23 13:59:48.436918-04 | 2022-06-23 13:29:56.982717-04 | 2022-06-23 13:29:48.415607-04
858161 | 651272 | More Top Stories - Google News | https://news.google.com/news/rss/ | syndicate
d | t | e55bfcc8f0292c22ec64b826066da046 | 2022-06-23 13:59:48.436918-04 | 2022-06-23 13:29:56.982717-04 | 2022-06-23 13:29:48.415607-04
585815 | 416841 | Top Stories - Google News | https://news.google.com/news/rss/?ned=cn&hl=zh-CN | syndicate
d | t | 9d7cf6e5b5fa3f061dcef5b627a01b7f | 2022-06-21 07:29:21.384175-04 | 2022-06-18 19:57:29.996409-04 | 2022-06-18 16:27:19.877431-04
858145 | 651259 | More Top Stories - Google News | https://news.google.com/news/rss/ | syndicate
d | t | 37c106ba0b42aa012ebe1380f3d3ad36 | 2022-06-23 13:59:48.436918-04 | 2022-06-23 13:29:56.982717-04 | 2022-06-23 13:29:48.415607-04
858168 | 651280 | More Top Stories - Google News | https://news.google.com/news/rss/ | syndicate
mediacloud=# select * from feeds where url like 'https://news.google.com/news%' and active = 't' and last_successful_download_time > '2022-06-01'
mediacloud-# ;
feeds_id | media_id | name | url | type
| active | last_checksum | last_attempted_download_time | last_successful_download_time | last_new_story_time
----------+----------+---------------------------------------------------------------+-------------------------------------------------------------------------+----------
--+--------+----------------------------------+-------------------------------+-------------------------------+-------------------------------
609336 | 440044 | Top Stories - Google News | https://news.google.com/news/rss/?ned=hk&hl=zh-HK | syndicate
d | t | 068bf9cc49c24f49820e57e5b68323d2 | 2022-06-23 13:59:48.436918-04 | 2022-06-23 11:59:55.7679-04 | 2022-06-23 10:29:44.990395-04
619144 | 449219 | Google News Taiwan | https://news.google.com/news/rss/?ned=tw&hl=zh-TW/data/rss | syndicate
d | t | 23cc660e1753546e174271379fdcfabe | 2022-06-18 21:27:26.04296-04 | 2022-06-18 21:27:30.30171-04 | 2022-06-08 07:57:54.41743-04
1096396 | 449219 | More Top Stories - Google News | https://news.google.com/news/rss/?ned=tw&hl=zh-TW&gl=TW/feeds | syndicate
d | t | 543fd42adfbc6d7bc67f4e769a9d7b90 | 2022-06-22 16:58:28.989124-04 | 2022-06-22 16:58:37.524483-04 | 2022-06-20 01:24:13.490346-04
982298 | 375820 | https://news.google.com/news/rss/headlines?ned=fr&gl=FR&hl=fr | https://news.google.com/news/rss/headlines?ned=fr&gl=FR&hl=fr | syndicate
d | t | 876a2d7e46a5a8c5a0006bb19e69537a | 2022-06-18 12:57:14.867751-04 | 2022-06-18 12:57:21.183534-04 | 2022-06-10 20:35:41.291772-04
585815 | 416841 | Top Stories - Google News | https://news.google.com/news/rss/?ned=cn&hl=zh-CN | syndicate
d | t | 9d7cf6e5b5fa3f061dcef5b627a01b7f | 2022-06-21 07:29:21.384175-04 | 2022-06-18 19:57:29.996409-04 | 2022-06-18 16:27:19.877431-04
982299 | 295679 | https://news.google.com/news/rss/?ned=en_ng&gl=NG&hl=en | https://news.google.com/news/rss/?ned=en_ng&gl=NG&hl=en | syndicate
d | t | 6d44b644a5e627c8159808b32c3e0718 | 2022-06-22 00:59:43.200625-04 | 2022-06-19 09:24:02.672628-04 | 2022-06-19 09:23:56.217234-04
844771 | 649508 | More Top Stories - Google News | https://news.google.com/news/rss/ | syndicate
d | t | 37c106ba0b42aa012ebe1380f3d3ad36 | 2022-06-23 13:59:48.436918-04 | 2022-06-23 13:29:56.982717-04 | 2022-06-23 13:29:48.415607-04
858151 | 651265 | More Top Stories - Google News | https://news.google.com/news/rss/ | syndicate
d | t | 37c106ba0b42aa012ebe1380f3d3ad36 | 2022-06-23 13:59:48.436918-04 | 2022-06-23 13:29:56.982717-04 | 2022-06-23 13:29:48.415607-04
858152 | 651264 | More Top Stories - Google News | https://news.google.com/news/rss/ | syndicate
d | t | 37c106ba0b42aa012ebe1380f3d3ad36 | 2022-06-23 13:59:48.436918-04 | 2022-06-23 13:29:56.982717-04 | 2022-06-23 13:29:48.415607-04
858149 | 375782 | More Top Stories - Google News | https://news.google.com/news/rss/ | syndicate
d | t | e55bfcc8f0292c22ec64b826066da046 | 2022-06-23 13:59:48.436918-04 | 2022-06-23 13:29:56.982717-04 | 2022-06-23 13:29:48.415607-04
858145 | 651259 | More Top Stories - Google News | https://news.google.com/news/rss/ | syndicate
d | t | 37c106ba0b42aa012ebe1380f3d3ad36 | 2022-06-23 13:59:48.436918-04 | 2022-06-23 13:29:56.982717-04 | 2022-06-23 13:29:48.415607-04
858168 | 651280 | More Top Stories - Google News | https://news.google.com/news/rss/ | syndicate
d | t | 37c106ba0b42aa012ebe1380f3d3ad36 | 2022-06-23 13:59:48.436918-04 | 2022-06-23 13:29:56.982717-04 | 2022-06-23 13:29:48.415607-04
858146 | 651260 | More Top Stories - Google News | https://news.google.com/news/rss/ | syndicate
d | t | 37c106ba0b42aa012ebe1380f3d3ad36 | 2022-06-23 13:59:48.436918-04 | 2022-06-23 13:29:56.982717-04 | 2022-06-23 13:29:48.415607-04
858166 | 651277 | More Top Stories - Google News | https://news.google.com/news/rss/ | syndicate
d | t | e55bfcc8f0292c22ec64b826066da046 | 2022-06-23 13:59:48.436918-04 | 2022-06-23 13:29:56.982717-04 | 2022-06-23 13:29:48.415607-04
858161 | 651272 | More Top Stories - Google News | https://news.google.com/news/rss/ | syndicate
d | t | e55bfcc8f0292c22ec64b826066da046 | 2022-06-23 13:59:48.436918-04 | 2022-06-23 13:29:56.982717-04 | 2022-06-23 13:29:48.415607-04
858150 | 651263 | More Top Stories - Google News | https://news.google.com/news/rss/ | syndicate
d | t | 37c106ba0b42aa012ebe1380f3d3ad36 | 2022-06-23 13:59:48.436918-04 | 2022-06-23 13:29:56.982717-04 | 2022-06-23 13:29:48.415607-04
858144 | 651257 | More Top Stories - Google News | https://news.google.com/news/rss/ | syndicate
d | t | 37c106ba0b42aa012ebe1380f3d3ad36 | 2022-06-23 13:59:48.436918-04 | 2022-06-23 13:29:56.982717-04 | 2022-06-23 13:29:48.415607-04
858159 | 651270 | More Top Stories - Google News | https://news.google.com/news/rss/ | syndicate
d | t | 37c106ba0b42aa012ebe1380f3d3ad36 | 2022-06-23 13:59:48.436918-04 | 2022-06-23 13:29:56.982717-04 | 2022-06-23 13:29:48.415607-04
811090 | 348022 | More Top Stories - Google News | https://news.google.com/news/rss/ | syndicate
d | t | 37c106ba0b42aa012ebe1380f3d3ad36 | 2022-06-23 13:59:48.436918-04 | 2022-06-23 13:29:56.982717-04 | 2022-06-23 13:29:48.415607-04
858158 | 651269 | More Top Stories - Google News | https://news.google.com/news/rss/ | syndicate
d | t | e55bfcc8f0292c22ec64b826066da046 | 2022-06-23 13:59:48.436918-04 | 2022-06-23 13:29:56.982717-04 | 2022-06-23 13:29:48.415607-04
858155 | 651262 | More Top Stories - Google News | https://news.google.com/news/rss/ | syndicate
d | t | 2c7dfdfd798c2171e8f9eb3ee028e371 | 2022-06-23 13:59:48.436918-04 | 2022-06-23 13:29:56.982717-04 | 2022-06-23 13:29:48.415607-04
858134 | 651250 | More Top Stories - Google News | https://news.google.com/news/rss/ | syndicate
d | t | b2be40894860c04223e2a27fc3293eca | 2022-06-23 13:59:48.436918-04 | 2022-06-23 13:29:56.982717-04 | 2022-06-23 13:29:48.415607-04
1647533 | 55491 | "hemp" - Google News | https://news.google.com/news?cf=all&hl=en&pz=1&ned=us&q=hemp&output=rss | syndicate
d | t | 32924a978b84538a3b5e9ab9ab5b111b | 2022-06-23 13:59:48.436918-04 | 2022-06-23 13:29:56.982717-04 | 2022-06-23 13:29:48.415607-04
858147 | 651261 | More Top Stories - Google News | https://news.google.com/news/rss/ | syndicate
d | t | b2be40894860c04223e2a27fc3293eca | 2022-06-23 13:59:48.436918-04 | 2022-06-23 13:29:56.982717-04 | 2022-06-23 13:29:48.415607-04
858135 | 651251 | More Top Stories - Google News | https://news.google.com/news/rss/ | syndicate
d | t | b2be40894860c04223e2a27fc3293eca | 2022-06-23 13:59:48.436918-04 | 2022-06-23 13:29:56.982717-04 | 2022-06-23 13:29:48.415607-04
858160 | 651271 | More Top Stories - Google News | https://news.google.com/news/rss/ | syndicate
d | t | 2c7dfdfd798c2171e8f9eb3ee028e371 | 2022-06-23 13:59:48.436918-04 | 2022-06-23 13:29:56.982717-04 | 2022-06-23 13:29:48.415607-04
858137 | 651252 | More Top Stories - Google News | https://news.google.com/news/rss/ | syndicate
d | t | b3ffebc77abac07dbab61ec47ba200d7 | 2022-06-23 13:59:48.436918-04 | 2022-06-23 13:29:56.982717-04 | 2022-06-23 13:29:48.415607-04
858169 | 651279 | More Top Stories - Google News | https://news.google.com/news/rss/ | syndicate
d | t | f9e54f98dfaa4/?ned=us&gl=US&hl=en | syndicate
d | t | f9e54f98dfaa4333b09bfb864d9423be | 2022-06-23 13:59:48.436918-04 | 2022-06-23 13:29:56.982717-04 | 2022-06-23 13:29:48.415607-04
695317 | 59984 | Top Stories - Google News | https://news.google.com/news/rss/ | syndicate
d | t | 2c7dfdfd798c2171e8f9eb3ee028e371 | 2022-06-23 13:59:48.436918-04 | 2022-06-23 13:29:56.982717-04 | 2022-06-23 13:29:48.415607-04
858156 | 365945 | More Top Stories - Google News | https://news.google.com/news/rss/ | syndicate
d | t | 99f9a2390dcd85af53420fdad74fefa3 | 2022-06-23 13:59:48.436918-04 | 2022-06-23 13:29:56.982717-04 | 2022-06-23 13:29:48.415607-04
858140 | 651254 | More Top Stories - Google News | https://news.google.com/news/rss/ | syndicate
d | t | 160f4e1da8a34dd8133eaed103b50601 | 2022-06-23 13:59:48.436918-04 | 2022-06-23 13:29:56.982717-04 | 2022-06-23 13:29:48.415607-04
858154 | 651267 | More Top Stories - Google News | https://news.google.com/news/rss/ | syndicate
d | t | 4cc7dcbc3cbfa7bcf2082581c63911a8 | 2022-06-23 13:59:48.436918-04 | 2022-06-23 13:29:56.982717-04 | 2022-06-23 13:29:48.415607-04
858167 | 651278 | More Top Stories - Google News | https://news.google.com/news/rss/ | syndicate
d | t | 99f9a2390dcd85af53420fdad74fefa3 | 2022-06-23 13:59:48.436918-04 | 2022-06-23 13:29:56.982717-04 | 2022-06-23 13:29:48.415607-04
858142 | 651256 | More Top Stories - Google News | https://news.google.com/news/rss/ | syndicate
d | t | 99f9a2390dcd85af53420fdad74fefa3 | 2022-06-23 13:59:48.436918-04 | 2022-06-23 13:29:56.982717-04 | 2022-06-23 13:29:48.415607-04
858138 | 416841 | More Top Stories - Google News | https://news.google.com/news/rss/ | syndicate
d | t | 99f9a2390dcd85af53420fdad74fefa3 | 2022-06-23 13:59:48.436918-04 | 2022-06-23 13:29:56.982717-04 | 2022-06-23 13:29:48.415607-04
858162 | 651273 | More Top Stories - Google News | https://news.google.com/news/rss/ | syndicate
d | t | 160f4e1da8a34dd8133eaed103b50601 | 2022-06-23 13:59:48.436918-04 | 2022-06-23 13:29:56.982717-04 | 2022-06-23 13:29:48.415607-04
858164 | 651275 | More Top Stories - Google News | https://news.google.com/news/rss/ | syndicate
d | t | 160f4e1da8a34dd8133eaed103b50601 | 2022-06-23 13:59:48.436918-04 | 2022-06-23 13:29:56.982717-04 | 2022-06-23 13:29:48.415607-04
(47 rows)
mediacloud=# select distinct media_id from feeds where url like 'https://news.google.com/news%' and active = 't' and last_successful_download_time > '2022-06-01'
;
media_id
----------
55491
59984
295679
348022
361078
365945
372638
375782
375820
416841
440044
449219
649508
651250
651251
651252
651253
651254
651255
651256
651257
651258
651259
651260
651261
651262
651263
651264
651265
651266
651267
651268
651269
651270
651271
651272
651273
651274
651275
651276
651277
651278
651279
651280
651281
(45 rows)
Whoops! The above two news.google.com domains were the same one! Here's another:
This feels like a well-defined subproblem - how to treat aggregators like Google News. I've relayed that question over to the researcher team with some ideas about paths forward. Feel free to chime in on that thread.
Things done:
DB queries on running MC system (for articles collected on June 1) to determine exact feeds responsible for URLs not present in backup feed:
pbudne@postgresql:~/rss$ cat art-downloads-by-feed4.psql
with t1 as(select count(1), d.feeds_id
from downloads_success d
where d.type = 'content' and
d.download_time >= '2022-06-01' and
d.download_time < '2022-06-02'
group by d.feeds_id
order by count desc
)
select SUM(t1.count), f.url -- XXX want to remove http(s)://
from t1, feeds f
where t1.feeds_id = f.feeds_id
group by f.url
order by sum desc;
pbudne@postgresql:~/rss$ head art-downloads-by-feed4.csv
sum,url
25497,http://www.mk.ru/rss/index.xml
19848,http://www.mk.ru/rss/news/index.xml
17870,https://www.mk.ru/rss/news/index.xml
10207,https://news.google.com/news/rss/
8433,https://www.shawlocal.com/arcio/rss/
6254,http://www.radionacional.com.ar/feed/
5953,http://rss.home.uol.com.br/index.xml
3689,http://www.aif.ru/rss/all.php
3554,https://www.svt.se/rss.xml
And from there, investigate why articles from those feed URLs (wanted or not) are not being picked up by the backup-rss-fetcher.
Dump of data from basic fetch & parse using requests
and feedparse
and data transform code from backup-rss-fetcher (no errors):
https://news.google.com/news/rss/
====== d.X:
bozo False
entries ...
feed ...
headers {}
encoding utf-8
version rss20
namespaces {'media': 'http://search.yahoo.com/mrss/'}
==== d.feed.X:
generator_detail FeedParserDict {'name': 'NFE/5.0'}
generator str 'NFE/5.0'
title str 'Top stories - Google News'
title_detail FeedParserDict {'type': 'text/plain', 'language': None, 'base': '...
links list [{'rel': 'alternate', 'type': 'text/html', 'href':...
link str 'https://news.google.com/?hl=en-US&gl=US&ceid=US:e...
language str 'en-US'
publisher str 'news-webmaster@google.com'
publisher_detail FeedParserDict {'email': 'news-webmaster@google.com'}
rights str '2022 Google Inc.'
rights_detail FeedParserDict {'type': 'text/plain', 'language': None, 'base': '...
updated str 'Thu, 30 Jun 2022 19:05:03 GMT'
updated_parsed struct_time time.struct_time(tm_year=2022, tm_mon=6, tm_mday=3...
subtitle str 'Google News'
subtitle_detail FeedParserDict {'type': 'text/html', 'language': None, 'base': ''...
---
{'url': 'http://google.com/__i/rss/rd/articles/cbmixmh0dhbzoi8vd3d3lndhc2hpbmd0b25wb3n0lmnvbs9jbgltyxrllwvudmlyb25tzw50lziwmjivmdyvmzavzxbhlxn1chjlbwuty291cnqtd2vzdc12axjnaw5pys_saqa?oc=5', 'domain': 'google.com', 'guid': 'CAIiEAYsDObicBMWgqrf6J5D06AqGAgEKg8IACoHCAowjtSUCjC30XQw0fe8Bg', 'published_at': datetime.datetime(2022, 6, 30, 18, 52, 39, tzinfo=tzutc()), 'title': 'Supreme Court ruling West Virginia v. EPA chills Biden climate agenda - The Washington Post', 'normalized_title': 'supreme court ruling west virginia v. epa chills biden climate agenda the washington post', 'normalized_title_hash': '969d315ba7bc6ec1f9c961df0b0c7c89'}
{'url': 'http://google.com/__i/rss/rd/articles/cbmixmh0dhbzoi8vd3d3lm5wci5vcmcvmjaymi8wni8zmc8xmta4nze0mzq1l2tldgfuamktynjvd24tamfja3nvbi1zdxbyzw1llwnvdxj0lw9hdggtc3dlyxjpbmctaw7saqa?oc=5', 'domain': 'google.com', 'guid': 'CAIiEKE18YUh-MQh6WoHMBn3KB0qFwgEKg4IACoGCAow9vBNMK3UCDClvJYH', 'published_at': datetime.datetime(2022, 6, 30, 18, 3, tzinfo=tzutc()), 'title': 'Ketanji Brown Jackson sworn in as first Black woman on the Supreme Court - NPR', 'normalized_title': 'ketanji brown jackson sworn in as first black woman on the supreme court npr', 'normalized_title_hash': '0846a04dd8a96c598fbbf38623ee7b75'}
{'url': 'http://google.com/__i/rss/rd/articles/cbmidwh0dhbzoi8vd3d3lndzai5jb20vyxj0awnszxmvymlkzw4tc2f5cy1ozs1zdxbwb3j0cy1legnlchrpb24tdg8tzmlsawj1c3rlci10by1jb2rpznktcm9llxytd2fkzs1pbnrvlwxhdy0xmty1nju5njeym9ibewh0dhbzoi8vd3d3lndzai5jb20vyw1wl2fydgljbgvzl2jpzgvulxnhexmtagutc3vwcg9ydhmtzxhjzxb0aw9ulxrvlwzpbglidxn0zxitdg8ty29kawz5lxjvzs12lxdhzgutaw50by1syxctmte2nty1otyxmjm?oc=5', 'domain': 'google.com', 'guid': 'CAIiEAIiu6fmy5UgJmVDVz6HJU0qGAgEKg8IACoHCAow1tzJATDnyxUwxsrPBg', 'published_at': datetime.datetime(2022, 6, 30, 18, 32, tzinfo=tzutc()), 'title': 'Biden Says He Supports Exception to Filibuster to Codify Roe v. Wade Into Law - The Wall Street Journal', 'normalized_title': 'biden says he supports exception to filibuster to codify roe v. wade into law the wall street journal', 'normalized_title_hash': '23b3643177b28bcc32f64adc1a2942ff'}
{'url': 'http://google.com/__i/rss/rd/articles/cbmiagh0dhbzoi8vywjjbmv3cy5nby5jb20vvvmvd29tyw4td2fudgvklw11cmrlci1wcm9mzxnzaw9uywwty3ljbglzdc1hcnjlc3rlzc1jb3n0ys1yawnhl3n0b3j5p2lkptg2mda4mjy40gfsahr0chm6ly9hymnuzxdzlmdvlmnvbs9hbxavvvmvd29tyw4td2fudgvklw11cmrlci1wcm9mzxnzaw9uywwty3ljbglzdc1hcnjlc3rlzc1jb3n0ys1yawnhl3n0b3j5p2lkptg2mda4mjy4?oc=5', 'domain': 'google.com', 'guid': 'CBMiaGh0dHBzOi8vYWJjbmV3cy5nby5jb20vVVMvd29tYW4td2FudGVkLW11cmRlci1wcm9mZXNzaW9uYWwtY3ljbGlzdC1hcnJlc3RlZC1jb3N0YS1yaWNhL3N0b3J5P2lkPTg2MDA4MjY40gFsaHR0cHM6Ly9hYmNuZXdzLmdvLmNvbS9hbXAvVVMvd29tYW4td2FudGVkLW11cmRlci1wcm9mZXNzaW9uYWwtY3ljbGlzdC1hcnJlc3RlZC1jb3N0YS1yaWNhL3N0b3J5P2lkPTg2MDA4MjY4', 'published_at': datetime.datetime(2022, 6, 30, 16, 29, 42, tzinfo=tzutc()), 'title': 'Woman wanted in murder of professional cyclist arrested in Costa Rica - ABC News', 'normalized_title': 'woman wanted in murder of professional cyclist arrested in costa rica abc news', 'normalized_title_hash': '0a7d2eb6a26b4f65aa2d1ad1cda1a66c'}
{'url': 'http://google.com/__i/rss/rd/articles/cbmidmh0dhbzoi8vd3d3lm5iy25ld3muy29tl3bvbgl0awnzl3n1chjlbwuty291cnqvc3vwcmvtzs1jb3vydc1hbgxvd3mtymlkzw4tzw5klxrydw1wlwvyys1yzw1haw4tbwv4awnvlxbvbgljes1yy25hmzixodfsaspodhrwczovl3d3dy5uymnuzxdzlmnvbs9uzxdzl2ftcc9yy25hmzixodc?oc=5', 'domain': 'google.com', 'guid': 'CAIiEMigMpbAABACG2LHUl_tiD4qGQgEKhAIACoHCAowvIaCCzDnxf4CMP2F8gU', 'published_at': datetime.datetime(2022, 6, 30, 16, 53, 57, tzinfo=tzutc()), 'title': "Supreme Court allows Biden to end Trump-era 'Remain in Mexico' policy - NBC News", 'normalized_title': "supreme court allows biden to end trump-era 'remain in mexico' policy nbc news", 'normalized_title_hash': 'c51e92764483bc1491c0e92a37b50aad'}
{'url': 'http://google.com/__i/rss/rd/articles/cbmir2h0dhbzoi8vd3d3lm55dgltzxmuy29tlziwmjivmdyvmzavdxmvzmxvcmlkys1hym9ydglvbi1iyw4tymxvy2tlzc5odg1s0gea?oc=5', 'domain': 'google.com', 'guid': 'CAIiEJPSW3UijMHKE0QNBtfICmsqFwgEKg8IACoHCAowjuuKAzCWrzwwt4QY', 'published_at': datetime.datetime(2022, 6, 30, 18, 45, 21, tzinfo=tzutc()), 'title': 'Florida Judge Will Temporarily Block 15-Week Abortion Ban - The New York Times', 'normalized_title': 'florida judge will temporarily block 15-week abortion ban the new york times', 'normalized_title_hash': '01b7b88fadbe192c424c7cb5d4bb8288'}
{'url': 'http://google.com/__i/rss/rd/articles/cbmiswh0dhbzoi8vd3d3lnd2dg0xmy5jb20vyxj0awnszs9kzxb1dgllcy1zag90lwfsywjhbwetymliyi1jb3vudhkvnda0njq1mtbsau1odhrwczovl3d3dy53dnrtmtmuy29tl2ftcc9hcnrpy2xll2rlchv0awvzlxnob3qtywxhymftys1iawjilwnvdw50es80mdq2nduxng?oc=5', 'domain': 'google.com', 'guid': 'CAIiEGGlifvLBkoTVrnXyDK1tQQqMwgEKioIACIQwI8Wot4P9IDiDxcV2kUGOCoUCAoiEMCPFqLeD_SA4g8XFdpFBjgw84yyBw', 'published_at': datetime.datetime(2022, 6, 30, 13, 14, tzinfo=tzutc()), 'title': "2 Bibb County deputies shot, manhunt underway for 'armed and dangerous' suspect - WVTM13 Birmingham", 'normalized_title': "2 bibb county deputies shot, manhunt underway for 'armed and dangerous' suspect wvtm13 birmingham", 'normalized_title_hash': '33f2554f6701dbb5a3b965ef2d80dd40'}
{'url': 'http://google.com/__i/rss/rd/articles/cbmifgh0dhbzoi8vd3d3lmluzgvwzw5kzw50lmnvlnvrl25ld3mvd29ybgqvyw1lcmljyxmvdxmtcg9saxrpy3mvdhj1bxatdg9kyxktd2l0bmvzcy1qyw4tni1jb21taxr0zwutagvhcmluzy1uzxdzlwiymteynje0lmh0bwzsayabahr0chm6ly93d3cuaw5kzxblbmrlbnquy28udwsvbmv3cy93b3jszc9hbwvyawnhcy91cy1wb2xpdgljcy90cnvtcc10b2rhes13axruzxnzlwphbi02lwnvbw1pdhrlzs1ozwfyaw5nlw5ld3mtyjixmti2mtquahrtbd9hbxa?oc=5', 'domain': 'google.com', 'guid': 'CAIiEE0iMMarb_LnGyytLMjBl5sqFggEKg4IACoGCAowzdp7ML-3CTCtyxU', 'published_at': datetime.datetime(2022, 6, 30, 14, 50, 18, tzinfo=tzutc()), 'title': 'Jan 6 hearings – live: Liz Cheney warns of Trump’s ‘domestic threat’ as Melania texts revealed - The Independent', 'normalized_title': 'jan 6 hearings – live liz cheney warns of trump’s ‘domestic threat’ as melania texts revealed the independent', 'normalized_title_hash': 'bcca3d0b0a35bbe57eef92de842803b3'}
{'url': 'http://google.com/__i/rss/rd/articles/cbmilafodhrwczovl3d3dy5uewrhawx5bmv3cy5jb20vbmv3lxlvcmsvbnljlwnyaw1ll255lxdvbwfulwzhdgfsbhktc2hvdc1wdxnoaw5nlxn0cm9sbgvylxvwcgvylwvhc3qtc2lkzs0ymdiymdyzmc1oatzyadnmcxpuywp6cgq1z252ajdochnrys1zdg9yes5odg1s0gea?oc=5', 'domain': 'google.com', 'guid': 'CAIiEJ0NbVI5ObSEaxHO1r_tJ1EqGQgEKhAIACoHCAow1feUCzCqy6oDMPPizQY', 'published_at': datetime.datetime(2022, 6, 30, 14, 37, tzinfo=tzutc()), 'title': "Baby's dad sought for questioning after young mom fatally shot in head pushing stroller on Upper East Side: NYPD sources - New York Daily News", 'normalized_title': "baby's dad sought for questioning after young mom fatally shot in head pushing stroller on upper east side nypd sources new york daily news", 'normalized_title_hash': 'a1f1995d20daff2bdb21bbd89fefc94e'}
{'url': 'http://google.com/__i/rss/rd/articles/cbmizmh0dhbzoi8vd3d3lndhc2hpbmd0b25wb3n0lmnvbs9wb2xpdgljcy8ymdiylza2lzmwl3n1chjlbwuty291cnqtzmvkzxjhbc1lbgvjdglvbnmtc3rhdgutbgvnaxnsyxr1cmvzl9ibaa?oc=5', 'domain': 'google.com', 'guid': 'CAIiEPBIITDMg5_LkY8on7igw1EqGAgEKg8IACoHCAowjtSUCjC30XQwzqe5AQ', 'published_at': datetime.datetime(2022, 6, 30, 17, 48, tzinfo=tzutc()), 'title': "Supreme Court to review state legislatures' power in federal elections - The Washington Post", 'normalized_title': "supreme court to review state legislatures' power in federal elections the washington post", 'normalized_title_hash': '72c644c4d0646db7399591e3bb70656b'}
{'url': 'http://google.com/__i/rss/rd/articles/cbmia2h0dhbzoi8vd3d3lnjldxrlcnmuy29tl3dvcmxkl2v1cm9wzs9ydxnzawetc3rlchmtdxatyxr0ywnrcy11a3jhaw5llwfmdgvylwxhbmrtyxjrlw5hdg8tc3vtbwl0ltiwmjitmdytmzav0gea?oc=5', 'domain': 'google.com', 'guid': 'CBMia2h0dHBzOi8vd3d3LnJldXRlcnMuY29tL3dvcmxkL2V1cm9wZS9ydXNzaWEtc3RlcHMtdXAtYXR0YWNrcy11a3JhaW5lLWFmdGVyLWxhbmRtYXJrLW5hdG8tc3VtbWl0LTIwMjItMDYtMzAv0gEA', 'published_at': datetime.datetime(2022, 6, 30, 17, 49, tzinfo=tzutc()), 'title': 'Russia abandons Snake Island in victory for Ukraine - Reuters.com', 'normalized_title': 'russia abandons snake island in victory for ukraine reuters.com', 'normalized_title_hash': 'ba3b447dab959859e7b122e377fb0237'}
{'url': 'http://google.com/__i/rss/rd/articles/cbmibmh0dhbzoi8vd3d3lmjsb29tymvyzy5jb20vbmv3cy9hcnrpy2xlcy8ymdiylta2ltmwl3vzlxdpbgwtzmfjzs1oawdolwdhcy1wcmljzxmtyxmtbg9uzy1hcy1pdc10ywtlcy1iawrlbi1zyxlz0gea?oc=5', 'domain': 'google.com', 'guid': 'CAIiEOLelQGrCSKPyf59VRvp8i8qGQgEKhAIACoHCAow4uzwCjCF3bsCMIrOrwM', 'published_at': datetime.datetime(2022, 6, 30, 13, 30, 30, tzinfo=tzutc()), 'title': "US Will Face High Gas Prices 'as Long as It Takes,' Biden Says - Bloomberg", 'normalized_title': "us will face high gas prices 'as long as it takes,' biden says bloomberg", 'normalized_title_hash': '108c756d23b0ae22f8f7317b78a48efd'}
{'url': 'http://google.com/__i/rss/rd/articles/cbmix2h0dhbzoi8vd3d3lmnubi5jb20vmjaymi8wni8zmc9jagluys9ob25nlwtvbmcty2hpbmetyw5uaxzlcnnhcnktegktyxjyaxzlcy1pbnrslwhuay9pbmrlec5odg1s0gfjahr0chm6ly9hbxauy25ulmnvbs9jbm4vmjaymi8wni8zmc9jagluys9ob25nlwtvbmcty2hpbmetyw5uaxzlcnnhcnktegktyxjyaxzlcy1pbnrslwhuay9pbmrlec5odg1s?oc=5', 'domain': 'google.com', 'guid': 'CBMiX2h0dHBzOi8vd3d3LmNubi5jb20vMjAyMi8wNi8zMC9jaGluYS9ob25nLWtvbmctY2hpbmEtYW5uaXZlcnNhcnkteGktYXJyaXZlcy1pbnRsLWhuay9pbmRleC5odG1s0gFjaHR0cHM6Ly9hbXAuY25uLmNvbS9jbm4vMjAyMi8wNi8zMC9jaGluYS9ob25nLWtvbmctY2hpbmEtYW5uaXZlcnNhcnkteGktYXJyaXZlcy1pbnRsLWhuay9pbmRleC5odG1s', 'published_at': datetime.datetime(2022, 6, 30, 12, 37, tzinfo=tzutc()), 'title': 'Xi Jinping leaves mainland China for the first time since the beginning of pandemic - CNN', 'normalized_title': 'xi jinping leaves mainland china for the first time since the beginning of pandemic cnn', 'normalized_title_hash': '42d81ea4a3b5f0e3c8bc41d9fca6ede2'}
{'url': 'http://google.com/__i/rss/rd/articles/cbmiawh0dhbzoi8vd3d3lmj1c2luzxnzaw5zawrlci5jb20vchv0aw4td2fybnmtzmlubgfuzc1hbmqtc3dlzgvulwfnywluc3qtag9zdgluzy1uyxrvlwluznjhc3rydwn0dxjlltiwmjitntibbwh0dhbzoi8vd3d3lmj1c2luzxnzaw5zawrlci5jb20vchv0aw4td2fybnmtzmlubgfuzc1hbmqtc3dlzgvulwfnywluc3qtag9zdgluzy1uyxrvlwluznjhc3rydwn0dxjlltiwmjitnj9hbxa?oc=5', 'domain': 'google.com', 'guid': 'CAIiEAPKgHF9dEMaJ2A5rTTVdZ4qLQgEKiUIACIbd3d3LmJ1c2luZXNzaW5zaWRlci5jb20vc2FpKgQICjAMMJD-CQ', 'published_at': datetime.datetime(2022, 6, 30, 9, 17, 11, tzinfo=tzutc()), 'title': 'Putin warns Finland and Sweden against hosting NATO infrastructure - Business Insider', 'normalized_title': 'putin warns finland and sweden against hosting nato infrastructure business insider', 'normalized_title_hash': '1e42b46fc34cc62c04a23df7e9ca5305'}
{'url': 'http://google.com/__i/rss/rd/articles/cbmicgh0dhbzoi8vd3d3lnjldxrlcnmuy29tl21hcmtldhmvzxvyb3bll3vzlwnvbnn1bwvylxnwzw5kaw5nlxjpc2vzlw1vzgvyyxrlbhktaw5mbgf0aw9ulxb1c2hlcy1oawdozxitmjaymi0wni0zmc_saqa?oc=5', 'domain': 'google.com', 'guid': 'CAIiEEeEdjx1aT7iPtDAUD9R2IcqFAgEKg0IACoGCAowt6AMMLAmMOpn', 'published_at': datetime.datetime(2022, 6, 30, 15, 55, tzinfo=tzutc()), 'title': 'U.S. consumer spending, underlying inflation slow in May - Reuters', 'normalized_title': 'u.s. consumer spending, underlying inflation slow in may reuters', 'normalized_title_hash': '0036ddcbac9b006a77968a25d735428f'}
{'url': 'http://google.com/__i/rss/rd/articles/cbmis2h0dhbzoi8vd3d3lmnubi5jb20vdhjhdmvsl2fydgljbguvywlylxryyxzlbc1jagfvcy1tb3jllxrvlwnvbwuvaw5kzxguahrtbnibr2h0dhbzoi8vd3d3lmnubi5jb20vdhjhdmvsl2ftcc9haxitdhjhdmvslwnoyw9zlw1vcmutdg8ty29tzs9pbmrlec5odg1s?oc=5', 'domain': 'google.com', 'guid': 'CBMiS2h0dHBzOi8vd3d3LmNubi5jb20vdHJhdmVsL2FydGljbGUvYWlyLXRyYXZlbC1jaGFvcy1tb3JlLXRvLWNvbWUvaW5kZXguaHRtbNIBR2h0dHBzOi8vd3d3LmNubi5jb20vdHJhdmVsL2FtcC9haXItdHJhdmVsLWNoYW9zLW1vcmUtdG8tY29tZS9pbmRleC5odG1s', 'published_at': datetime.datetime(2022, 6, 30, 13, 28, 30, tzinfo=tzutc()), 'title': 'Why more air travel chaos is on its way - CNN', 'normalized_title': 'why more air travel chaos is on its way cnn', 'normalized_title_hash': '4e29e49aa2cdb02f9d83288104f7755e'}
{'url': 'http://google.com/__i/rss/rd/articles/cbmixwh0dhbzoi8vd3d3lmnuymmuy29tlziwmjivmdyvmzavag91c2luzy1zag9ydgfnzs1zdgfydhmtzwfzaw5nlwfzlwxpc3rpbmdzlxn1cmdllwlulwp1bmuuahrtbnibywh0dhbzoi8vd3d3lmnuymmuy29tl2ftcc8ymdiylza2lzmwl2hvdxnpbmctc2hvcnrhz2utc3rhcnrzlwvhc2luzy1hcy1saxn0aw5ncy1zdxjnzs1pbi1qdw5llmh0bww?oc=5', 'domain': 'google.com', 'guid': 'CAIiEH1Clyg8W9nDMoBhtyWu9r4qGQgEKhAIACoHCAow2Nb3CjDivdcCMJ_5ngY', 'published_at': datetime.datetime(2022, 6, 30, 17, 33, 59, tzinfo=tzutc()), 'title': 'Housing shortage starts easing as listings surge in June - CNBC', 'normalized_title': 'housing shortage starts easing as listings surge in june cnbc', 'normalized_title_hash': '35679e74bbc88934fcd3877557d2a083'}
{'url': 'http://google.com/__i/rss/rd/articles/cbmib2h0dhbzoi8vd3d3lmludmvzdgluzy5jb20vbmv3cy9ly29ub215l2z1dhvyzxmtdhvtymxllw9ulwxhc3qtzgf5lw9mlwetdg9ycmlklwzpcnn0agfszi1vbi1ncm93dggtzmvhcnmtmjg0mjyyonibaa?oc=5', 'domain': 'google.com', 'guid': 'CBMib2h0dHBzOi8vd3d3LmludmVzdGluZy5jb20vbmV3cy9lY29ub215L2Z1dHVyZXMtdHVtYmxlLW9uLWxhc3QtZGF5LW9mLWEtdG9ycmlkLWZpcnN0aGFsZi1vbi1ncm93dGgtZmVhcnMtMjg0MjYyONIBAA', 'published_at': datetime.datetime(2022, 6, 30, 15, 11, tzinfo=tzutc()), 'title': 'Wall Street plunges, S&P 500 set for worst first-half since 1970 By Reuters - Investing.com', 'normalized_title': 'wall street plunges, s 500 set for worst first-half since 1970 by reuters investing.com', 'normalized_title_hash': '6b2eb3359a5d3a93e4c73bb9b8f5c99d'}
{'url': 'http://google.com/__i/rss/rd/articles/cbmixgh0dhbzoi8vd3d3lnrozxzlcmdllmnvbs8ymdiylzyvmzavmjmxodkzotivc2ftc3vuzy1nyw1pbmctahvilxhib3gtc3rhzglhlwx1bmetyxbwcy1zdxbwb3j00gea?oc=5', 'domain': 'google.com', 'guid': 'CAIiENRivVh9bRiT_nrM15jE3hkqFwgEKg4IACoGCAow3O8nMMqOBjCzr7gD', 'published_at': datetime.datetime(2022, 6, 30, 15, 0, tzinfo=tzutc()), 'title': "Samsung's gaming TV hub launches with Xbox, Stadia, and GeForce Now streaming - The Verge", 'normalized_title': "samsung's gaming tv hub launches with xbox, stadia, and geforce now streaming the verge", 'normalized_title_hash': 'c6f3ef3611746b42221286da995c61ab'}
{'url': 'http://google.com/__i/rss/rd/articles/cbmiogh0dhbzoi8vd3d3lmvuz2fkz2v0lmnvbs9izxn0lxntyxj0cghvbmvzlte0mdawndkwmc5odg1s0ge8ahr0chm6ly93d3cuzw5nywrnzxquy29tl2ftcc9izxn0lxntyxj0cghvbmvzlte0mdawndkwmc5odg1s?oc=5', 'domain': 'google.com', 'guid': 'CBMiOGh0dHBzOi8vd3d3LmVuZ2FkZ2V0LmNvbS9iZXN0LXNtYXJ0cGhvbmVzLTE0MDAwNDkwMC5odG1s0gE8aHR0cHM6Ly93d3cuZW5nYWRnZXQuY29tL2FtcC9iZXN0LXNtYXJ0cGhvbmVzLTE0MDAwNDkwMC5odG1s', 'published_at': datetime.datetime(2022, 6, 30, 14, 2, 6, tzinfo=tzutc()), 'title': 'The best smartphones you can buy right now - Engadget', 'normalized_title': 'the best smartphones you can buy right now engadget', 'normalized_title_hash': '4c6cee1bd7245520a1e697adf48b5165'}
{'url': 'http://google.com/__i/rss/rd/articles/cbmivwh0dhbzoi8vd3d3lnbvbhlnb24uy29tlzizmtg5odi2l3nryxrlltqtzwfybhktywnjzxnzlwluc2lkzxitchjvz3jhbs1lbgvjdhjvbmljlwfydhpsawjodhrwczovl3d3dy5wb2x5z29ulmnvbs9wbgf0zm9ybs9hbxavmjmxodk4mjyvc2thdgutnc1lyxjses1hy2nlc3mtaw5zawrlci1wcm9ncmftlwvszwn0cm9uawmtyxj0cw?oc=5', 'domain': 'google.com', 'guid': 'CAIiECQ0CgfMksNrdQ_Seh9Ka2QqGAgEKg8IACoHCAow6IDNATDnu3cwhq6EAw', 'published_at': datetime.datetime(2022, 6, 30, 16, 54, 51, tzinfo=tzutc()), 'title': 'Skate insider program to offer early access to playtests of Skate 4 - Polygon', 'normalized_title': 'skate insider program to offer early access to playtests of skate 4 polygon', 'normalized_title_hash': '0a01b47572e305a61c14c375fecc43e6'}
{'url': 'http://google.com/__i/rss/rd/articles/cbmiv2h0dhbzoi8vd3d3lnrozxzlcmdllmnvbs8ymdiylzyvmzavmjmxodk0ntavy2hyb21llxbhc3n3b3jklw1hbmfnzxitdxbkyxrlcy1pb3mtyw5kcm9pznibaa?oc=5', 'domain': 'google.com', 'guid': 'CAIiEPmGUVsMy8REX-WNsLnVTZ4qFwgEKg4IACoGCAow3O8nMMqOBjCkztQD', 'published_at': datetime.datetime(2022, 6, 30, 16, 0, tzinfo=tzutc()), 'title': 'Chrome password manager update will let you manually add credentials on all platforms - The Verge', 'normalized_title': 'chrome password manager update will let you manually add credentials on all platforms the verge', 'normalized_title_hash': 'e411a35f668dbc0cb18147ab7741174b'}
{'url': 'http://google.com/__i/rss/rd/articles/cbmiy2h0dhbzoi8vcgfnzxnpec5jb20vmjaymi8wni8zmc90cmf2axmtymfya2vycy1kyxvnahrlci1wb3n0cy1uzxctcgljlxdpdggtzgfklwftawqtag9zcgl0ywxpemf0aw9ul9ibz2h0dhbzoi8vcgfnzxnpec5jb20vmjaymi8wni8zmc90cmf2axmtymfya2vycy1kyxvnahrlci1wb3n0cy1uzxctcgljlxdpdggtzgfklwftawqtag9zcgl0ywxpemf0aw9ul2ftcc8?oc=5', 'domain': 'google.com', 'guid': 'CAIiEAJM3ffGcsFENe3a-N5J9WAqGQgEKhAIACoHCAowmID5CjDdtOACMLzWtAU', 'published_at': datetime.datetime(2022, 6, 30, 11, 28, tzinfo=tzutc()), 'title': "Travis Barker's daughter, Alabama, posts new photo with dad amid hospitalization - Page Six", 'normalized_title': "travis barker's daughter, alabama, posts new photo with dad amid hospitalization page six", 'normalized_title_hash': '31528e5395151c83870b7f89d6ed9346'}
{'url': 'http://google.com/__i/rss/rd/articles/cbmicwh0dhbzoi8vd3d3lmxhdgltzxmuy29tl2vudgvydgfpbm1lbnqtyxj0cy9idxnpbmvzcy9zdg9yes8ymdiylta2ltmwl3jhbmrhbgwtzw1tzxr0lwjydwnllxdpbgxpcy1wywnpbm8tbgfsys1rzw500gf7ahr0chm6ly93d3cubgf0aw1lcy5jb20vzw50zxj0ywlubwvudc1hcnrzl2j1c2luzxnzl3n0b3j5lziwmjitmdytmzavcmfuzgfsbc1lbw1ldhqtynj1y2utd2lsbglzlxbhy2luby1sywxhlwtlbnq_x2ftcd10cnvl?oc=5', 'domain': 'google.com', 'guid': 'CBMicWh0dHBzOi8vd3d3LmxhdGltZXMuY29tL2VudGVydGFpbm1lbnQtYXJ0cy9idXNpbmVzcy9zdG9yeS8yMDIyLTA2LTMwL3JhbmRhbGwtZW1tZXR0LWJydWNlLXdpbGxpcy1wYWNpbm8tbGFsYS1rZW500gF7aHR0cHM6Ly93d3cubGF0aW1lcy5jb20vZW50ZXJ0YWlubWVudC1hcnRzL2J1c2luZXNzL3N0b3J5LzIwMjItMDYtMzAvcmFuZGFsbC1lbW1ldHQtYnJ1Y2Utd2lsbGlzLXBhY2luby1sYWxhLWtlbnQ_X2FtcD10cnVl', 'published_at': datetime.datetime(2022, 6, 30, 12, 0, 30, tzinfo=tzutc()), 'title': 'Randall Emmett faces civil fraud claims, abuse allegations - Los Angeles Times', 'normalized_title': 'randall emmett faces civil fraud claims, abuse allegations los angeles times', 'normalized_title_hash': '214a630b8c87efd5d8cd4a4d6d097326'}
{'url': 'http://google.com/__i/rss/rd/articles/cbmiagh0dhbzoi8vd3d3lmvvbmxpbmuuy29tl25ld3mvmtmznjm3ms90aw0tywxszw4tz2l2zxmtaglzlwjydxrhbgx5lwhvbmvzdc10ag91z2h0cy1vbi1uzxctbglnahr5zwfylw1vdmll0ggeawh0dhbzoi8vd3d3lmvvbmxpbmuuy29tl2ftcc9uzxdzlzezmzyznzevdgltlwfsbgvulwdpdmvzlwhpcy1icnv0ywxses1ob25lc3qtdghvdwdodhmtb24tbmv3lwxlc3npz3jlyxrlcmxpz2h0ewvhcmxlc3npz3jlyxrlci1tb3zpzq?oc=5', 'domain': 'google.com', 'guid': 'CAIiEO-_Wx41AS_K--Xb2N-l0mMqGQgEKhAIACoHCAowq_7zCjCt4tQCMPa0pwY', 'published_at': datetime.datetime(2022, 6, 30, 14, 34, tzinfo=tzutc()), 'title': 'Tim Allen Gives His Brutally Honest Thoughts on New Lightyear Movie - E! NEWS', 'normalized_title': 'tim allen gives his brutally honest thoughts on new lightyear movie e! news', 'normalized_title_hash': 'f63349b84ed927fa1b88e5c988ace0ef'}
{'url': 'http://google.com/__i/rss/rd/articles/cbmip2h0dhbzoi8vd3d3lmz0lmnvbs9jb250zw50l2zknzjjmmuxlwi1mtytngnlns1izmy0ltkwmdg5mmfiytdjntibaa?oc=5', 'domain': 'google.com', 'guid': 'CBMiP2h0dHBzOi8vd3d3LmZ0LmNvbS9jb250ZW50L2ZkNzJjMmUxLWI1MTYtNGNlNS1iZmY0LTkwMDg5MmFiYTdjNtIBAA', 'published_at': datetime.datetime(2022, 6, 29, 23, 43, 4, tzinfo=tzutc()), 'title': 'R&B singer R Kelly sentenced to 30 years in prison - Financial Times', 'normalized_title': 'r singer r kelly sentenced to 30 years in prison financial times', 'normalized_title_hash': '1d5620424c320cdb8f08f186c6595db3'}
{'url': 'http://google.com/__i/rss/rd/articles/cbmirgh0dhbzoi8vd3d3lmhvb3bzcnvtb3jzlmnvbs8ymdiylza2lziwmjitbmjhlwzyzwutywdlbmn5lxbyaw1lci5odg1s0gea?oc=5', 'domain': 'google.com', 'guid': 'CBMiRGh0dHBzOi8vd3d3Lmhvb3BzcnVtb3JzLmNvbS8yMDIyLzA2LzIwMjItbmJhLWZyZWUtYWdlbmN5LXByaW1lci5odG1s0gEA', 'published_at': datetime.datetime(2022, 6, 30, 17, 13, 26, tzinfo=tzutc()), 'title': '2022 NBA Free Agency Primer - hoopsrumors.com', 'normalized_title': '2022 nba free agency primer hoopsrumors.com', 'normalized_title_hash': '112f4968dd8a62d7344ef8a0e70d7f9a'}
{'url': 'http://google.com/__i/rss/rd/articles/cbmigwfodhrwczovl3d3dy5jynnzcg9ydhmuy29tl2nvbgxlz2utzm9vdgjhbgwvbmv3cy91c2mtdwnsys1sb29raw5nlxrvlwxlyxzllxbhyy0xmi1mb3itymlnlxrlbi1pbi0ymdi0lxrob3vnac1kzwfslw5vdc15zxqtzmluywxpemvkl9ibhwfodhrwczovl3d3dy5jynnzcg9ydhmuy29tl2nvbgxlz2utzm9vdgjhbgwvbmv3cy91c2mtdwnsys1sb29raw5nlxrvlwxlyxzllxbhyy0xmi1mb3itymlnlxrlbi1pbi0ymdi0lxrob3vnac1kzwfslw5vdc15zxqtzmluywxpemvkl2ftcc8?oc=5', 'domain': 'google.com', 'guid': 'CAIiEFQzClcZqr5KA0mkhURT5ckqFggEKg4IACoGCAow5tYTMODEAjCSuwQ', 'published_at': datetime.datetime(2022, 6, 30, 18, 14, tzinfo=tzutc()), 'title': 'USC, UCLA looking to leave Pac-12 for Big Ten in 2024, though deal not yet finalized - CBS Sports', 'normalized_title': 'usc, ucla looking to leave pac-12 for big ten in 2024, though deal not yet finalized cbs sports', 'normalized_title_hash': '2eea93887dd71cc3c066423897b31c6c'}
{'url': 'http://google.com/__i/rss/rd/articles/cbmic2h0dhbzoi8vd3d3lmvzcg4uy28udwsvbmjhl3n0b3j5l18vawqvmzqxnza2mjqvy2hhcmxvdhrllwhvcm5ldhmtbwlszxmtynjpzgdlcy1hcnjlc3rlzc1sb3mtyw5nzwxlcy1ldmutznjlzs1hz2vuy3nsayabahr0chm6ly93d3cuzxnwbi5jby51ay9uymevc3rvcnkvxy9pzc8znde3mdyync9jagfybg90dgutag9ybmv0cy1tawxlcy1icmlkz2vzlwfycmvzdgvklwxvcy1hbmdlbgvzlwv2zs1mcmvllwfnzw5jet9wbgf0zm9ybt1hbxa?oc=5', 'domain': 'google.com', 'guid': 'CBMic2h0dHBzOi8vd3d3LmVzcG4uY28udWsvbmJhL3N0b3J5L18vaWQvMzQxNzA2MjQvY2hhcmxvdHRlLWhvcm5ldHMtbWlsZXMtYnJpZGdlcy1hcnJlc3RlZC1sb3MtYW5nZWxlcy1ldmUtZnJlZS1hZ2VuY3nSAYABaHR0cHM6Ly93d3cuZXNwbi5jby51ay9uYmEvc3RvcnkvXy9pZC8zNDE3MDYyNC9jaGFybG90dGUtaG9ybmV0cy1taWxlcy1icmlkZ2VzLWFycmVzdGVkLWxvcy1hbmdlbGVzLWV2ZS1mcmVlLWFnZW5jeT9wbGF0Zm9ybT1hbXA', 'published_at': datetime.datetime(2022, 6, 30, 14, 27, 37, tzinfo=tzutc()), 'title': "Charlotte Hornets' Miles Bridges arrested in Los Angeles on eve of free agency - ESPN.co.uk", 'normalized_title': "charlotte hornets' miles bridges arrested in los angeles on eve of free agency espn.co.uk", 'normalized_title_hash': '883361f3025808e1ccd895bf599d488c'}
{'url': 'http://google.com/__i/rss/rd/articles/cbmiw2h0dhbzoi8vbnlwb3n0lmnvbs8ymdiylza2lzmwl2rlam91bnrllw11cnjhes10cmfkzs1kcmf3cy10d2l0dgvylxjlywn0aw9ulwzyb20tdhjhzs15b3vuzy_sav9odhrwczovl255cg9zdc5jb20vmjaymi8wni8zmc9kzwpvdw50zs1tdxjyyxktdhjhzgutzhjhd3mtdhdpdhrlci1yzwfjdglvbi1mcm9tlxryywutew91bmcvyw1wlw?oc=5', 'domain': 'google.com', 'guid': 'CAIiECKZzA9BT-0i7_zPCkEUNf8qGAgEKg8IACoHCAowhK-LAjD4ySww-9S0BQ', 'published_at': datetime.datetime(2022, 6, 30, 14, 41, tzinfo=tzutc()), 'title': "Trae Young loving the Dejounte Murray trade: 'S--t just got real' - New York Post", 'normalized_title': "trae young loving the dejounte murray trade 's--t just got real' new york post", 'normalized_title_hash': '211a3e23c60c4e1e2f3c615ae794af62'}
{'url': 'http://google.com/__i/rss/rd/articles/cbmiwwh0dhbzoi8vd3d3lmnubi5jb20vmjaymi8wni8zmc9hc2lhl2fuy2llbnqtcgfuzgetymftym9vlxrodw1ilxnpehrolwrpz2l0lxnjbi9pbmrlec5odg1s0gfdahr0chm6ly9hbxauy25ulmnvbs9jbm4vmjaymi8wni8zmc9hc2lhl2fuy2llbnqtcgfuzgetymftym9vlxrodw1ilxnpehrolwrpz2l0lxnjbi9pbmrlec5odg1s?oc=5', 'domain': 'google.com', 'guid': 'CBMiWWh0dHBzOi8vd3d3LmNubi5jb20vMjAyMi8wNi8zMC9hc2lhL2FuY2llbnQtcGFuZGEtYmFtYm9vLXRodW1iLXNpeHRoLWRpZ2l0LXNjbi9pbmRleC5odG1s0gFdaHR0cHM6Ly9hbXAuY25uLmNvbS9jbm4vMjAyMi8wNi8zMC9hc2lhL2FuY2llbnQtcGFuZGEtYmFtYm9vLXRodW1iLXNpeHRoLWRpZ2l0LXNjbi9pbmRleC5odG1s', 'published_at': datetime.datetime(2022, 6, 30, 15, 2, tzinfo=tzutc()), 'title': 'Pandas evolved their most perplexing feature at least 6 million years ago - CNN', 'normalized_title': 'pandas evolved their most perplexing feature at least 6 million years ago cnn', 'normalized_title_hash': 'eaab18a9488cad2f893970cd7d0b063c'}
{'url': 'http://google.com/__i/rss/rd/articles/cbmiwgh0dhbzoi8vd3d3lmnubi5jb20vmjaymi8wni8zmc9jagluys9jagluys10awfud2vultetbwfycy1pbwfnzxmtaw50bc1obmstc2nul2luzgv4lmh0bwzsavxodhrwczovl2ftcc5jbm4uy29tl2nubi8ymdiylza2lzmwl2noaw5hl2noaw5hlxrpyw53zw4tms1tyxjzlwltywdlcy1pbnrslwhuay1zy24vaw5kzxguahrtba?oc=5', 'domain': 'google.com', 'guid': 'CBMiWGh0dHBzOi8vd3d3LmNubi5jb20vMjAyMi8wNi8zMC9jaGluYS9jaGluYS10aWFud2VuLTEtbWFycy1pbWFnZXMtaW50bC1obmstc2NuL2luZGV4Lmh0bWzSAVxodHRwczovL2FtcC5jbm4uY29tL2Nubi8yMDIyLzA2LzMwL2NoaW5hL2NoaW5hLXRpYW53ZW4tMS1tYXJzLWltYWdlcy1pbnRsLWhuay1zY24vaW5kZXguaHRtbA', 'published_at': datetime.datetime(2022, 6, 30, 6, 12, tzinfo=tzutc()), 'title': "China's Mars probe has photographed the entire red planet - CNN", 'normalized_title': "china's mars probe has photographed the entire red planet cnn", 'normalized_title_hash': 'd9d1449bc80642a9c0b9e6b615d21373'}
{'url': 'http://google.com/__i/rss/rd/articles/cbmiswh0dhbzoi8vc3bhy2vuzxdzlmnvbs9uyxnhlxbyzxbhcmvzlxrvlxjlbgvhc2utzmlyc3qtandzdc1zy2llbmnllwltywdlcy_saqa?oc=5', 'domain': 'google.com', 'guid': 'CBMiSWh0dHBzOi8vc3BhY2VuZXdzLmNvbS9uYXNhLXByZXBhcmVzLXRvLXJlbGVhc2UtZmlyc3QtandzdC1zY2llbmNlLWltYWdlcy_SAQA', 'published_at': datetime.datetime(2022, 6, 30, 1, 35, 33, tzinfo=tzutc()), 'title': 'NASA prepares to release first JWST science images - SpaceNews', 'normalized_title': 'nasa prepares to release first jwst science images spacenews', 'normalized_title_hash': 'ce62d25761fa4228ac2494fccedc2eb4'}
{'url': 'http://google.com/__i/rss/rd/articles/cbmixgh0dhbzoi8vyxjzdgvjag5py2euy29tl3njawvuy2uvmjaymi8wni9uyxnhlwfpbxmtdg8tbgf1bmnolxrozs1zbhmtcm9ja2v0lwlulwp1c3qtmi1tb250ahmv0gfiahr0chm6ly9hcnn0zwnobmljys5jb20vc2npzw5jzs8ymdiylza2l25hc2etywltcy10by1syxvuy2gtdghllxnscy1yb2nrzxqtaw4tanvzdc0ylw1vbnrocy8_yw1wpte?oc=5', 'domain': 'google.com', 'guid': 'CBMiXGh0dHBzOi8vYXJzdGVjaG5pY2EuY29tL3NjaWVuY2UvMjAyMi8wNi9uYXNhLWFpbXMtdG8tbGF1bmNoLXRoZS1zbHMtcm9ja2V0LWluLWp1c3QtMi1tb250aHMv0gFiaHR0cHM6Ly9hcnN0ZWNobmljYS5jb20vc2NpZW5jZS8yMDIyLzA2L25hc2EtYWltcy10by1sYXVuY2gtdGhlLXNscy1yb2NrZXQtaW4tanVzdC0yLW1vbnRocy8_YW1wPTE', 'published_at': datetime.datetime(2022, 6, 28, 21, 39, 55, tzinfo=tzutc()), 'title': 'NASA aims to launch the SLS rocket in just 2 months - Ars Technica', 'normalized_title': 'nasa aims to launch the sls rocket in just 2 months ars technica', 'normalized_title_hash': '9c2ef5bfaa8e503719f5db26fe99e8a3'}
{'url': 'http://google.com/__i/rss/rd/articles/cbmiywh0dhbzoi8vd3d3lmzveg5ld3muy29tl2hlywx0ac93ag8td2fybnmtc3vzdgfpbmvklxryyw5zbwlzc2lvbi1tb25rzxlwb3gtcmlza3mtdnvsbmvyywjszs1ncm91chpsawvodhrwczovl3d3dy5mb3huzxdzlmnvbs9ozwfsdggvd2hvlxdhcm5zlxn1c3rhaw5lzc10cmfuc21pc3npb24tbw9ua2v5cg94lxjpc2tzlxz1bg5lcmfibgutz3jvdxbzlmftca?oc=5', 'domain': 'google.com', 'guid': 'CAIiECSYSYHKh192pIVwg88Qzc8qGQgEKhAIACoHCAowwL2ICzCckocDMKOkvwc', 'published_at': datetime.datetime(2022, 6, 30, 16, 6, 56, tzinfo=tzutc()), 'title': "WHO warns 'sustained transmission' of monkeypox risks vulnerable groups - Fox News", 'normalized_title': "who warns 'sustained transmission' of monkeypox risks vulnerable groups fox news", 'normalized_title_hash': '6b4562b7839b54177117fa496b196922'}
{'url': 'http://google.com/__i/rss/rd/articles/cbmidwh0dhbzoi8vc2npdgvjagrhawx5lmnvbs9hbwvyawnhbi1ozwfydc1hc3nvy2lhdglvbi1zbgvlcc1kdxjhdglvbi1pcy1lc3nlbnrpywwty29tcg9uzw50lwzvci1ozwfydc1hbmqtynjhaw4tagvhbhrol9ibaa?oc=5', 'domain': 'google.com', 'guid': 'CBMidWh0dHBzOi8vc2NpdGVjaGRhaWx5LmNvbS9hbWVyaWNhbi1oZWFydC1hc3NvY2lhdGlvbi1zbGVlcC1kdXJhdGlvbi1pcy1lc3NlbnRpYWwtY29tcG9uZW50LWZvci1oZWFydC1hbmQtYnJhaW4taGVhbHRoL9IBAA', 'published_at': datetime.datetime(2022, 6, 30, 3, 3, 39, tzinfo=tzutc()), 'title': 'American Heart Association: Sleep Duration Is Essential Component for Heart and Brain Health - SciTechDaily', 'normalized_title': 'american heart association sleep duration is essential component for heart and brain health scitechdaily', 'normalized_title_hash': 'de8f3be47d136e5b03925e1aa3e8c602'}
{'url': 'http://google.com/__i/rss/rd/articles/cbmiu2h0dhbzoi8vd3d3lnlhag9vlmnvbs9lbnrlcnrhaw5tzw50l3utbm93lw9mzmvylxzhy2npbmf0aw9ucy1hz2fpbnn0lte3mtexmjcxny5odg1s0gfbahr0chm6ly93d3cuewfob28uy29tl2ftcgh0bwwvzw50zxj0ywlubwvudc91lw5vdy1vzmzlci12ywnjaw5hdglvbnmtywdhaw5zdc0xnzexmti3mtcuahrtba?oc=5', 'domain': 'google.com', 'guid': 'CBMiU2h0dHBzOi8vd3d3LnlhaG9vLmNvbS9lbnRlcnRhaW5tZW50L3Utbm93LW9mZmVyLXZhY2NpbmF0aW9ucy1hZ2FpbnN0LTE3MTExMjcxNy5odG1s0gFbaHR0cHM6Ly93d3cueWFob28uY29tL2FtcGh0bWwvZW50ZXJ0YWlubWVudC91LW5vdy1vZmZlci12YWNjaW5hdGlvbnMtYWdhaW5zdC0xNzExMTI3MTcuaHRtbA', 'published_at': datetime.datetime(2022, 6, 29, 17, 11, 12, tzinfo=tzutc()), 'title': 'U.S. Will Now Offer Vaccinations Against Monkeypox to Anyone Who May Have Been Exposed to the Virus - Yahoo Entertainment', 'normalized_title': 'u.s. will now offer vaccinations against monkeypox to anyone who may have been exposed to the virus yahoo entertainment', 'normalized_title_hash': 'ae0bf204e53235d8de73271b58912d2c'}
{'url': 'http://google.com/__i/rss/rd/articles/cbmiawh0dhbzoi8vd3rvcc5jb20vdmlyz2luawevmjaymi8wni9hzgrpdglvbmfslxbyzxn1bwvklwnhc2vzlw9mlw1vbmtlexbvec1pzgvudglmawvklwlulxbhcnrzlw9mlxzpcmdpbmlhl9ibaa?oc=5', 'domain': 'google.com', 'guid': 'CAIiEAAv25P0UhIJ_EKXCdhiYkgqGQgEKhAIACoHCAowjaWHCzCr3IUDMLz0nAY', 'published_at': datetime.datetime(2022, 6, 30, 2, 47, 36, tzinfo=tzutc()), 'title': "Additional 'presumed cases' of monkeypox identified in parts of Virginia - WTOP", 'normalized_title': "additional 'presumed cases' of monkeypox identified in parts of virginia wtop", 'normalized_title_hash': 'd6b2aef8811dbcae1345404d2bf1ff36'}
That list of domains is super useful. I poked at just two of them to see if that revealed anything useful:
I took a minute to dive into http://www.mk.ru/rss/index.xml
as a sample to understand. Some notes:
cat data/feeds-2022-06-02.csv | grep www.mk.ru/rss/index.xml > /tmp/mkru-feeds.csv
select * from feeds where mc_feeds_id in (869553,869251,869366,869412,869331,869342,869311,869593,869402,869492,869261,869543,869643,869291,869623,869241,869539,869613,869281,869442,355390,869512,869665,869563,869482,869392,869372,869472,869655,869352,869432,1055385,869452,869583,869502,869603,869522,869422,869633,869301,869321,869382,869271,869462,869578);
I did the same with https://www.shawlocal.com/arcio/rss/
:
cat data/feeds-2022-06-02.csv | grep https://www.shawlocal.com/arcio/rss/ > /tmp/shawlocal-feeds.csv
select * from feeds where mc_feeds_id in (2343491,2343499,2338962,2343543,2343531,2338949,2338958,2338951,2338945,2338966,2338968,2343526,2343485,2338954,2342545,2338920,2343514,2343551,2343544,2341476,2343550,2343523,2343500,2343505,2343516,2343492,2338943,2343525,2338959,2338953,2343546,2338932,2343494,2343489,2343497,2343495,2338961,2343513,2340953,2343549,2343487,2343545,2343493,2343529,2338948,2338955,2343530,2338970,2343501,2343488,2343540,2338946,2338929,2343536,2338950,2343502,2338944,2343508,2341311,2343496,2343542,2338956,2343547,2338971,2343524,2343503,2338930,2338935,2338975,2343490,2343541);
select published_at::DATE, count(1) from stories where feed_id in (select id from feeds where mc_feeds_id in (2343491,2343499,2338962,2343543,2343531,2338949,2338958,2338951,2338945,2338966,2338968,2343526,2343485,2338954,2342545,2338920,2343514,2343551,2343544,2341476,2343550,2343523,2343500,2343505,2343516,2343492,2338943,2343525,2338959,2338953,2343546,2338932,2343494,2343489,2343497,2343495,2338961,2343513,2340953,2343549,2343487,2343545,2343493,2343529,2338948,2338955,2343530,2338970,2343501,2343488,2343540,2338946,2338929,2343536,2338950,2343502,2338944,2343508,2341311,2343496,2343542,2338956,2343547,2338971,2343524,2343503,2338930,2338935,2338975,2343490,2343541)) group by 1;
Fixing another bug related to title-based deduplication (in d50913633d83c7aa7abf8ac01f553241fd873e67) may help here too. It was over aggresivley ignoring duplicated titles across all media sources (instead of just within one source).
Pulled in 603,247 stories with pub dates of 7/6/22 🎉 That's only about 30% lower than the production server - so maybe fixing those 2 bugs noted above were the solution here? 🤞🏽 We should check again in a few days.
Still no luck. I rewrote the central fetching task to make it easier to read, and try and streamline DB handle usage. I'm resetting the last_fetch_failures to 0 and trying again to determine if there are other biz logic bugs that could be causing a lower total ingest volume. If nothing changes then we need to do another dive comparing production day of ingest vs. backup day of ingest I think.
As you can see below, we have occasional spikes in story fetch volume, but those don't correlate into more stories by day; perhaps because they are duplicates.
Here you can see ingest by publication day is fairly steady. The dip in avg volume in late May is due to the enhanced de-duplication added in #5 & #6.
Closing - I think we're well past this.
Hhow come the backup RSS fetcher only pulls in ~250k stories each day, while the prod system finds almost a million? I've updated the backup one so it has all the RSS feeds that have:
That matches 144,269 feeds in the prod database (as of yesterday). Another difference is that this backup one only stores precisely unique URLs.
Are there a bunch of topics running that are spidering things? Is my query filter to find "actually active" feeds overly limited? Does my logic to check them "regularly" have a bug? Thoughts on other potential causes of this big difference?