mediacloud / story-indexer

The core pipeline used to ingest online news stories in the Media Cloud archive.
https://mediacloud.org
Apache License 2.0
2 stars 5 forks source link

regenerate "historical" CSV files that don't have downloads_id #329

Open philbudne opened 3 months ago

philbudne commented 3 months ago

Fernando approved retrieving the archived database to retrieve downloads_ids for Nov/Dec 2021. @rahulbot asked:

Is there a quick way to audit all the other historical CSV files to see if we need download_ids from any other periods as well?

I wrote a script to enumerate objects in the mediacloud-database... buckets, requesting only the first 128 bytes of each object, and saving only the first line to a disk file.

Here is what I found, looking for first lines which lack downloads_id, and ignoring summary files (leaving analysis/discussion to follow-ups):

pbudne@angwin:~/s3-audit$ find mediacloud-database-* -name \*csv | xargs grep -v downloads_id | sort | egrep -v 'summar(y_|ies)|database_b.csv'
mediacloud-database-c-files/csv_files/2021_11_12.csv:https://expert.ru/doc-list/rss/
mediacloud-database-c-files/csv_files/2021_11_13.csv:https://www.dharitri.com/feed/
mediacloud-database-c-files/csv_files/2021_11_14.csv:https://www.casilinanews.it/feed
mediacloud-database-c-files/csv_files/2021_11_15.csv:https://www.mercurynews.com/feed/
mediacloud-database-c-files/csv_files/2021_11_16.csv:https://www.monacomatin.mc/rss
mediacloud-database-c-files/csv_files/2021_11_17.csv:https://www.diariodebatepregon.com/rss/home.xml
mediacloud-database-c-files/csv_files/2021_11_18.csv:http://avenueskhabar.com.np/feed/
mediacloud-database-c-files/csv_files/2021_11_20.csv:http://cms-delivery-mia.terra.com/feeder/public/articles/20e07ef2795b2310VgnVCM3000009af154d0RCRD.rss
mediacloud-database-c-files/csv_files/2021_11_21.csv:https://www.ikz-online.de/?
mediacloud-database-files/2013/stories_2013-03-21.csv:collect_date,stories_id,media_id,url
mediacloud-database-files/2013/stories_2013-05-04.csv:collect_date,stories_id,media_id,url
mediacloud-database-files/2013/stories_2013-05-13.csv:collect_date,stories_id,media_id,url
mediacloud-database-files/2013/stories_2013-08-06.csv:collect_date,stories_id,media_id,url
mediacloud-database-files/2013/stories_2013-09-02.csv:collect_date,stories_id,media_id,url
mediacloud-database-files/2013/stories_2017-12-01.csv:collect_date,stories_id,media_id,url
mediacloud-database-files/2014/stories_2014-10-24.csv:collect_date,stories_id,media_id,url
mediacloud-database-files/2017/stories_2017-09-22.csv:collect_date,stories_id,media_id,url
mediacloud-database-files/2017/stories_2017-09-28.csv:collect_date,stories_id,media_id,url
mediacloud-database-files/2017/stories_2017-12-01.csv:collect_date,stories_id,media_id,url
mediacloud-database-files/2019/stories_2019-01-10.csv:collect_date,stories_id,media_id,url
mediacloud-database-files/2019/stories_2019-01-11.csv:collect_date,stories_id,media_id,url
mediacloud-database-files/2019/stories_2019-01-12.csv:collect_date,stories_id,media_id,url
mediacloud-database-files/2019/stories_2019-01-13.csv:collect_date,stories_id,media_id,url
mediacloud-database-files/2019/stories_2019-01-14.csv:collect_date,stories_id,media_id,url
mediacloud-database-files/2019/stories_2019-01-15.csv:collect_date,stories_id,media_id,url
mediacloud-database-files/2019/stories_2019-01-16.csv:collect_date,stories_id,media_id,url
mediacloud-database-files/2019/stories_2019-01-17.csv:collect_date,stories_id,media_id,url
mediacloud-database-files/2019/stories_2019-01-18.csv:collect_date,stories_id,media_id,url
mediacloud-database-files/2019/stories_2019-01-19.csv:collect_date,stories_id,media_id,url
mediacloud-database-files/2019/stories_2019-01-20.csv:collect_date,stories_id,media_id,url
mediacloud-database-files/2019/stories_2019-01-21.csv:collect_date,stories_id,media_id,url
mediacloud-database-files/2019/stories_2019-01-22.csv:collect_date,stories_id,media_id,url
mediacloud-database-files/2019/stories_2019-01-23.csv:collect_date,stories_id,media_id,url
mediacloud-database-files/2019/stories_2019-01-24.csv:collect_date,stories_id,media_id,url
mediacloud-database-files/2019/stories_2019-01-25.csv:collect_date,stories_id,media_id,url
mediacloud-database-files/2019/stories_2019-01-26.csv:collect_date,stories_id,media_id,url
mediacloud-database-files/2019/stories_2019-01-27.csv:collect_date,stories_id,media_id,url
mediacloud-database-files/2019/stories_2019-01-28.csv:collect_date,stories_id,media_id,url
mediacloud-database-files/2019/stories_2019-01-29.csv:collect_date,stories_id,media_id,url
mediacloud-database-files/2019/stories_2019-01-30.csv:collect_date,stories_id,media_id,url
mediacloud-database-files/2019/stories_2019-01-31.csv:collect_date,stories_id,media_id,url
mediacloud-database-files/2019/stories_2019-04-10.csv:collect_date,stories_id,media_id,url
mediacloud-database-files/2019/stories_2019-04-11.csv:collect_date,stories_id,media_id,url
mediacloud-database-files/2019/stories_2019-04-12.csv:collect_date,stories_id,media_id,url
mediacloud-database-files/2019/stories_2019-04-13.csv:collect_date,stories_id,media_id,url
mediacloud-database-files/2019/stories_2019-04-14.csv:collect_date,stories_id,media_id,url
mediacloud-database-files/2019/stories_2019-04-15.csv:collect_date,stories_id,media_id,url
mediacloud-database-files/2019/stories_2019-04-16.csv:collect_date,stories_id,media_id,url
mediacloud-database-files/2019/stories_2019-04-17.csv:collect_date,stories_id,media_id,url
mediacloud-database-files/2019/stories_2019-04-18.csv:collect_date,stories_id,media_id,url
mediacloud-database-files/2019/stories_2019-04-19.csv:collect_date,stories_id,media_id,url
mediacloud-database-files/2019/stories_2019-04-20.csv:collect_date,stories_id,media_id,url
mediacloud-database-files/2019/stories_2019-04-21.csv:collect_date,stories_id,media_id,url
mediacloud-database-files/2019/stories_2019-04-22.csv:collect_date,stories_id,media_id,url
mediacloud-database-files/2019/stories_2019-04-23.csv:collect_date,stories_id,media_id,url
mediacloud-database-files/2019/stories_2019-04-24.csv:collect_date,stories_id,media_id,url
mediacloud-database-files/2019/stories_2019-04-25.csv:collect_date,stories_id,media_id,url
mediacloud-database-files/2019/stories_2019-04-26.csv:collect_date,stories_id,media_id,url
mediacloud-database-files/2019/stories_2019-04-27.csv:collect_date,stories_id,media_id,url
mediacloud-database-files/2019/stories_2019-04-28.csv:collect_date,stories_id,media_id,url
mediacloud-database-files/2019/stories_2019-04-29.csv:collect_date,stories_id,media_id,url
mediacloud-database-files/2019/stories_2019-04-30.csv:collect_date,stories_id,media_id,url
mediacloud-database-files/2019/stories_2019-04-6.csv:collect_date,stories_id,media_id,url
mediacloud-database-files/2019/stories_2019-04-7.csv:collect_date,stories_id,media_id,url
mediacloud-database-files/2019/stories_2019-04-8.csv:collect_date,stories_id,media_id,url
mediacloud-database-files/2019/stories_2019-04-9.csv:collect_date,stories_id,media_id,url
mediacloud-database-files/2021/stories_2021-11-23.csv:collect_date,stories_id,media_id,url
mediacloud-database-files/2021/stories_2021-11-24.csv:collect_date,stories_id,media_id,url
mediacloud-database-files/2021/stories_2021-11-25.csv:collect_date,stories_id,media_id,url
mediacloud-database-files/2021/stories_2021-11-26.csv:collect_date,stories_id,media_id,url
mediacloud-database-files/2021/stories_2021-11-27.csv:collect_date,stories_id,media_id,url
mediacloud-database-files/2021/stories_2021-11-28.csv:collect_date,stories_id,media_id,url
mediacloud-database-files/2021/stories_2021-11-29.csv:collect_date,stories_id,media_id,url
mediacloud-database-files/2021/stories_2021-11-30.csv:collect_date,stories_id,media_id,url
mediacloud-database-files/2021/stories_2021-12-01.csv:collect_date,stories_id,media_id,url
mediacloud-database-files/2021/stories_2021-12-02.csv:collect_date,stories_id,media_id,url
mediacloud-database-files/2021/stories_2021-12-03.csv:collect_date,stories_id,media_id,url
mediacloud-database-files/2021/stories_2021-12-04.csv:collect_date,stories_id,media_id,url
mediacloud-database-files/2021/stories_2021-12-05.csv:collect_date,stories_id,media_id,url
mediacloud-database-files/2021/stories_2021-12-06.csv:collect_date,stories_id,media_id,url
mediacloud-database-files/2021/stories_2021-12-07.csv:collect_date,stories_id,media_id,url
mediacloud-database-files/2021/stories_2021-12-08.csv:collect_date,stories_id,media_id,url
mediacloud-database-files/2021/stories_2021-12-09.csv:collect_date,stories_id,media_id,url
mediacloud-database-files/2021/stories_2021-12-10.csv:collect_date,stories_id,media_id,url
mediacloud-database-files/2021/stories_2021-12-11.csv:collect_date,stories_id,media_id,url
mediacloud-database-files/2021/stories_2021-12-12.csv:collect_date,stories_id,media_id,url
mediacloud-database-files/2021/stories_2021-12-13.csv:collect_date,stories_id,media_id,url
mediacloud-database-files/2021/stories_2021-12-14.csv:collect_date,stories_id,media_id,url
mediacloud-database-files/2021/stories_2021-12-15.csv:collect_date,stories_id,media_id,url
mediacloud-database-files/2021/stories_2021-12-16.csv:collect_date,stories_id,media_id,url
mediacloud-database-files/2021/stories_2021-12-17.csv:collect_date,stories_id,media_id,url
mediacloud-database-files/2021/stories_2021-12-18.csv:collect_date,stories_id,media_id,url
mediacloud-database-files/2021/stories_2021-12-19.csv:collect_date,stories_id,media_id,url
mediacloud-database-files/2021/stories_2021-12-20.csv:collect_date,stories_id,media_id,url
mediacloud-database-files/2021/stories_2021-12-21.csv:collect_date,stories_id,media_id,url
mediacloud-database-files/2021/stories_2021-12-22.csv:collect_date,stories_id,media_id,url
mediacloud-database-files/2021/stories_2021-12-23.csv:collect_date,stories_id,media_id,url
mediacloud-database-files/2021/stories_2021-12-24.csv:collect_date,stories_id,media_id,url
mediacloud-database-files/2021/stories_2021-12-25.csv:collect_date,stories_id,media_id,url
philbudne commented 3 months ago

TO BE VERIFIED! The DB Epoch chart always makes me dizzy!

Dates with missing downloads_id (with applicable epochs in parens):

So, all but nine days in 2021 (previously not known about) are covered by backups of the final PG database (Epoch F)? (Or dumps of the C or E epochs, if we had them)

pgulley commented 3 months ago

My gloss after meeting with all of the indexer team was that we want to move ahead with restoring from the Epoch F backup, which should cover the maximal range here- (sans those nine days in the middle of November).

thepsalmist commented 1 month ago

I did a check of the distinct dates for stories from 2020 - 2008, from the restored database B, vs the csv files we have in s3 s3://mediacloud-files/${year}. The finding is that for all the distinct dates where we have a story based on collect_date, there a corresponding csv file on s3 (_v1, or _v2 prefix) db_vs_s3_comparison.csv

philbudne commented 1 month ago

@thepsalmist -- excellent!!! Thank you!!!