Open rahulbot opened 5 months ago
It's my recall that there are no holes in the 2021 record, HOWEVER:
There's the "overlap" period, where two instances of the system used the same range(s) of database ids.
There is code in hist-fetcher.py to handle this (and copious comments), but I HAVE NOT TESTED IT. My recall is that I look at the date in the CSV file and determine which database Epoch that corresponds to (B or D), and then look at all versions of the S3 object for that downloads_id, and pick (the?) one that was written in the same epoch/time-period WITHOUT checking that the dates are close/sane.
In other words, it requires some examination before letting it rip.
There are 589 daily CSV files in file s3://mediacloud-database-files/2021/
It looks like there are (up to?) three versions of each day for dates between 2021-01-01 and 2021-05-15.
Except for the dates where all three files are all trivial (59 bytes, presumably just a column/header line), the three files seem to have different sizes. Here is a snip of "aws s3 ls" output:
2022-12-28 11:43:52 16234828 stories_2021-04-05.csv 2022-12-28 11:43:53 80451582 stories_2021-04-05_v2.csv 2022-12-28 11:43:53 72063717 stories_2021-04-05_v3.csv
2022-12-28 11:43:54 16485868 stories_2021-04-06.csv 2022-12-28 11:43:56 96556995 stories_2021-04-06_v2.csv 2022-12-28 11:43:57 78333686 stories_2021-04-06_v3.csv
2022-12-28 11:43:57 16470124 stories_2021-04-07.csv 2022-12-28 11:43:59 97247724 stories_2021-04-07_v2.csv 2022-12-28 11:43:59 77504728 stories_2021-04-07_v3.csv
2022-12-28 11:43:59 16496389 stories_2021-04-08.csv 2022-12-28 11:44:02 79591431 stories_2021-04-08_v2.csv 2022-12-28 11:44:04 59 stories_2021-04-08_v3.csv
2022-11-23 01:14:05 59 stories_2021-04-09.csv 2022-12-28 11:44:05 59 stories_2021-04-09_v2.csv 2022-12-28 11:44:05 59 stories_2021-04-09_v3.csv
2022-11-23 01:14:05 59 stories_2021-04-10.csv 2022-12-28 11:44:05 59 stories_2021-04-10_v2.csv 2022-12-28 11:44:05 59 stories_2021-04-10_v3.csv
2022-11-23 01:14:06 59 stories_2021-04-11.csv 2022-12-28 11:44:05 59 stories_2021-04-11_v2.csv 2022-12-28 11:44:06 59 stories_2021-04-11_v3.csv
2022-11-23 01:14:06 59 stories_2021-04-12.csv 2022-12-28 11:44:06 59 stories_2021-04-12_v2.csv 2022-12-28 11:44:06 59 stories_2021-04-12_v3.csv
2022-12-28 11:44:06 16910341 stories_2021-04-13.csv 2022-12-28 11:44:06 59 stories_2021-04-13_v2.csv 2022-12-28 11:44:06 154798430 stories_2021-04-13_v3.csv
2022-12-28 11:44:06 17830103 stories_2021-04-14.csv 2022-12-28 11:44:06 117418301 stories_2021-04-14_v2.csv 2022-12-28 11:44:08 76491043 stories_2021-04-14_v3.csv
The "v2" file seems to be the largest in MOST cases, but see 4-13 above as an exception.
If we process more than one file for each date, it seems possible/likely that we could download each HTML file as many as three times.
hist-queuer.py eliminates downloading S3 objects that are for the same remote URL (the old system downloaded a story each time it appeared in a different feed), but cannot look across input csv files.
Does anyone remember how the different versions came about?
Does anyone remember how the different versions came about?
The versions were from batching the CSV e.g 00:00-12:00 and 12:00 to 23:59 to avoid Postgres query timeouts. Script to combine the csvs to a single version should fit
Script to combine the csvs to a single version should fit
Not strictly necessary: the queuer doesn't check file suffixes (.csv).
The only advantage would be eliminating duplicates (the legacy system downloaded a story multiple times if it appeared in multiple RSS feeds).
The hist- stack processing the "Database D" csv files (for 2022 and 2021) has completed processing of 2021/12/31 back to 2021/12/27. BUT hist-fetcher was unhappy with 2021/12/26 (looks like it tossed everything from that day into quarantine).
The queuer processes files in reverse lexicographic (character set) order. This is my analysis of the order of the chunks to process (from top to bottom this time) to process the year:
status | bucket | prefix | object name format | start | end | notes |
---|---|---|---|---|---|---|
done | mediacloud-database-d-files | stories_2021_mm_dd | 2021/12/26 | 2021/12/31 | *️ | |
see below | mediacloud-database-files | /2021/stories_2021-1 | stories_2021-mm-dd | 2021/10/14 | 2021/12/25 | † *️ |
see below | mediacloud-database-c-files | 2021_mm_dd.csv | 2021/11/12 | 2021/11/21 | ||
see below | mediacloud-database-b-files | /stories_2021- | stories_2021-mm-dd | 2021/10/14 | 2021/11/11 | |
see below | mediacloud-database-b-files | /stories2021 | stories_2021_mm_dd | 2021/09/15 | 2021/10/13 | *️ |
done | mediacloud-database-files | /2021/stories_2021-0 | stories_2021-mm-dd | 2021/01/01 | 2021/09/14 | † |
*️NOTE: DB D/B overlap periods are 2021-09-15 thru 2021-11-11 (DB B) and 2021-12-26 thru 2022-01-25 (DB D)
† empty files for 1/31, 4/8 thru 4/12, missing 9/15 thru 10/13 (??)
see https://github.com/mediacloud/story-indexer/issues/329 for other ranges that need downloads_ids
status | start | end | notes |
---|---|---|---|
X | 2021-11-12 | 2021-12-25 | need downloads_ids (from DB F) |
X | 2021-11-21 | 2021-11-12 | needs downloads_ids (from DB C) |
running | 2021-11-11 | 2021-09-14 | available in mc-db-b-files AND mc-db-files? *️ |
I tested 10/1/2021 (epoch B) in my dev stack, pulled main from upstream, merged main to staging, and launched a staging stack on bernstein:
./docker/deploy.sh -T historical -Y 2021 -I s3://mediacloud-database-files/2021/stories_2021-1
I also removed the production hist-indexer stack.
My normal steps when deploying ANY stack. After a few minutes:
In this case, grafana showed fetcher activity, but no parser activity: the hist-fetcher had reported all stories as "bad-dlid"
The errors look like:
2024-08-09 19:55:08,237 82f8da64c4d1 hist-fetcher INFO: bad-dlid: EMPTY
2024-08-09 19:55:08,238 82f8da64c4d1 hist-fetcher INFO: quarantine: QuarantineException('bad-dlid')
I downloaded the csv file:
aws s3 cp s3://mediacloud-database-files/2021/stories_2021-12-25.csv .
and I don't see a downloads_id:
collect_date,stories_id,media_id,url
2021-12-25 08:18:56.292171,2147483646,272136,https://observador.pt/2021/12/25/covid-19-coordenador-cientifico-italiano-considera-reforco-da-vacina-crucial-contra-omicron/
2021-12-25 08:18:56.291037,2147483645,375830,https://www.sudouest.fr/gironde/gujan-mestras/bassin-d-arcachon-une-cabane-en-feu-a-gujan-mestras-7452333.php
Same for the 24th:
pbudne@ifill:~$ head -3 stories_2021-12-24.csv
collect_date,stories_id,media_id,url
2021-12-24 23:49:57.083076,2147254999,655701,https://ulan.mk.ru/video/2021/12/25/pervaya-godovshhina-podnyatiya-andreevskogo-flaga-na-korvete-geroy-rossiyskoy-federacii-aldar-cydenzhapov.html
2021-12-24 23:49:57.014398,2147254998,655701,https://kavkaz.mk.ru/social/2021/12/25/student-stavropolskogo-filiala-rankhigs-prinyal-uchastie-v-forume-studaktiva.html
Looking at the other end of mediacloud-database-files/2021, at 2021-01-01:
There are three files, all have downloads_id:
phil@p27:~$ head -1 stories_2021-01-01*
==> stories_2021-01-01.csv <==
collect_date,stories_id,media_id,downloads_id,feeds_id,url
==> stories_2021-01-01_v2.csv <==
collect_date,stories_id,media_id,downloads_id,feeds_id,url
==> stories_2021-01-01_v3.csv <==
collect_date,stories_id,media_id,downloads_id,feeds_id,url
Across the three files, only 86% are unique:
phil@p27:~$ wc -l stories_2021-01-01*
100001 stories_2021-01-01.csv
328389 stories_2021-01-01_v2.csv
257678 stories_2021-01-01_v3.csv
686068 total
phil@p27:~$ sort -u stories_2021-01-01* | wc -l
588464
Looks like the overlap is between, not within the files:
phil@p27:~$ for x in stories_2021-01-01*; do
> echo $x; sort -u $x | wc -l
> done
stories_2021-01-01.csv
100001
stories_2021-01-01_v2.csv
328389
stories_2021-01-01_v3.csv
257678
Looks like downloads_id starts being present in mid November:
phil@p27:~$ head -1 stories_2021-11-*
==> stories_2021-11-11.csv <==
collect_date,stories_id,media_id,downloads_id,feeds_id,url
==> stories_2021-11-23.csv <==
collect_date,stories_id,media_id,url
Generated at different times (from different databases?):
phil@p27:~$ aws s3 ls s3://mediacloud-database-files/2021/| grep 2021.11
2022-11-23 08:23:01 171607723 stories_2021-11-01.csv
2022-11-23 08:23:04 190802596 stories_2021-11-02.csv
2022-11-23 08:23:04 192421552 stories_2021-11-03.csv
2022-11-23 08:23:06 187801571 stories_2021-11-04.csv
2022-11-23 08:23:07 177020925 stories_2021-11-05.csv
2022-11-23 08:23:18 129040813 stories_2021-11-06.csv
2022-11-23 08:23:22 138231824 stories_2021-11-07.csv
2022-11-23 08:23:22 190744853 stories_2021-11-08.csv
2022-11-23 08:23:23 196410284 stories_2021-11-09.csv
2022-11-23 08:23:27 199264106 stories_2021-11-10.csv
2022-11-23 08:23:32 189706034 stories_2021-11-11.csv
2023-02-17 00:53:13 190362479 stories_2021-11-23.csv
2023-02-17 00:53:13 257671902 stories_2021-11-24.csv
2023-02-17 00:53:13 160786079 stories_2021-11-25.csv
2023-02-17 00:53:13 151860953 stories_2021-11-26.csv
2023-02-17 00:53:13 109919256 stories_2021-11-27.csv
2023-02-17 00:53:16 112239747 stories_2021-11-28.csv
2023-02-17 00:53:16 163218600 stories_2021-11-29.csv
2023-02-17 00:53:16 181842769 stories_2021-11-30.csv
What does that download_id represent- is that just an index value from the old system? I assume some change to the indexer will be necessary in order to re-index without that value, but am I right to say that it's not really necessary in the new index?
@pgulley downloads_id is the key (object/file name) for the saved HTML in the mediacloud-downloads-backup S3 bucket (necessary to retrieve the HTML, otherwise not of use to the new system)
oh interesting- does that mean we don't have html saved for December 2021 then?
oh interesting- does that mean we don't have html saved for December 2021 then?
The HTML is on S3, we just don't have the keys to retrieve it ready to use. The question is whether we still have backups of (one of) the PG database(s) that can link the downloads_id, URL and date range.
Going back to my goal of testing my recent hist-fetcher fixes, I've launched a staging stack on bernstein for just October CSV files:
./docker/deploy.sh -T historical -Y 2021 -I s3://mediacloud-database-files/2021/stories_2021-10
And now running a staging stack (50K stories) for a date inside DB-B dl_id overlap range:
./docker/deploy.sh -T historical -Y 2021 -I s3://mediacloud-database-b-files/stories_2021_10_01.csv
January 2021 has finished. Just merged main to staging, and launched a staging stack on bernstein:
root@bernstein:/nfs/ang/users/pbudne/story-indexer# ./docker/deploy.sh -T historical -Y 2021 -H /stories_2021-0
upstream staging branch up to date.
creating docker-compose.yml
cloning story-indexer-config repo
QUEUER_ARGS --force --sample-size 50000 s3://mediacloud-database-files/2021/stories_2021-0
Logs show it starting from mid-September:
2024-08-17 21:35:58,500 1d54803aa1ef hist-queuer INFO: process_file s3://mediacloud-database-files/2021/stories_2021-09-14.csv
Created https://github.com/mediacloud/story-indexer/issues/328 on some observations about hist ingest.
Looking at the remaining 2021 data hole:
(date range 2021-11-19 thru 2021-12-26). The HTML (and possibly RSS) files are on S3:
downloads_id | isodate | unix_ts | S3 Version Id |
---|---|---|---|
3306845683 | 2021-11-18T23:59:57+00:00 | 1637279997.0 | |
3306845686 | 2021-11-18T23:59:59+00:00 | 1637279999.0 | |
3306845687 | 2021-11-19T00:00:00+00:00 | 1637280000.0 | |
3306845689 | 2021-11-19T00:00:02+00:00 | 1637280002.0 | |
3360714567 | 2021-12-26T10:32:56+00:00 | 1640514776.0 | |
3360714571 | 2021-12-26T10:32:56+00:00 | 1640514776.0 | |
3360714572 | 2021-12-26T10:32:55+00:00 | 1640514775.0 | 8wiLKAcAGi5E14BySGDraSCl6fSe00DW |
3360714573 | 2021-12-26T10:33:09+00:00 | 1640514789.0 | DhkyUneyXRbliInFgDRV8yKyh6ggM1dz |
3360714572 | 2022-01-29T00:21:20+00:00 | 1643415680.0 | hWCsDiDn6QouyLe89IS6cU2oMrqIWzZ8 |
3360714573 | 2022-01-29T00:21:28+00:00 | 1643415688.0 | vrbK1TSlTaYa5raJrLWpi6k0J0xh5uT2 |
3360714574 | 2022-01-29T00:21:21+00:00 | 1643415681.0 | |
3360714577 | 2022-01-29T00:21:20+00:00 | 1643415680.0 |
With, or without the RSS, it may be possible to recover some significant percentage of the HTML using "cannonical link" tags...
Once 2022 re-indexing is done (#271) we should start on 2021, continuing to work backwards chronologically. For all these dates I think we can ingest stories from previously-generated CSV files that refer to HTML files in the giant S3 bucket. Is this right?
This should include:
Ideally this would be done by July 1, but that depends on when 2022 finishes.