Open philbudne opened 3 months ago
TO BE VERIFIED! The DB Epoch chart always makes me dizzy!
Dates with missing downloads_id
(with applicable epochs in parens):
So, all but nine days in 2021 (previously not known about) are covered by backups of the final PG database (Epoch F)? (Or dumps of the C or E epochs, if we had them)
My gloss after meeting with all of the indexer team was that we want to move ahead with restoring from the Epoch F backup, which should cover the maximal range here- (sans those nine days in the middle of November).
I did a check of the distinct dates for stories from 2020 - 2008, from the restored database B, vs the csv files we have in s3 s3://mediacloud-files/${year}
. The finding is that for all the distinct dates where we have a story based on collect_date, there a corresponding csv file on s3 (_v1, or _v2 prefix)
db_vs_s3_comparison.csv
@thepsalmist -- excellent!!! Thank you!!!
Fernando approved retrieving the archived database to retrieve downloads_ids for Nov/Dec 2021. @rahulbot asked:
I wrote a script to enumerate objects in the
mediacloud-database...
buckets, requesting only the first 128 bytes of each object, and saving only the first line to a disk file.Here is what I found, looking for first lines which lack downloads_id, and ignoring summary files (leaving analysis/discussion to follow-ups):