mediacloud / story-indexer

The core pipeline used to ingest online news stories in the Media Cloud archive.
https://mediacloud.org
Apache License 2.0
2 stars 5 forks source link

re-index 2021 data #300

Open rahulbot opened 5 months ago

rahulbot commented 5 months ago

Once 2022 re-indexing is done (#271) we should start on 2021, continuing to work backwards chronologically. For all these dates I think we can ingest stories from previously-generated CSV files that refer to HTML files in the giant S3 bucket. Is this right?

This should include:

Ideally this would be done by July 1, but that depends on when 2022 finishes.

philbudne commented 5 months ago

It's my recall that there are no holes in the 2021 record, HOWEVER:

  1. There's the "overlap" period, where two instances of the system used the same range(s) of database ids.

    There is code in hist-fetcher.py to handle this (and copious comments), but I HAVE NOT TESTED IT. My recall is that I look at the date in the CSV file and determine which database Epoch that corresponds to (B or D), and then look at all versions of the S3 object for that downloads_id, and pick (the?) one that was written in the same epoch/time-period WITHOUT checking that the dates are close/sane.

    In other words, it requires some examination before letting it rip.

  2. There are 589 daily CSV files in file s3://mediacloud-database-files/2021/

    It looks like there are (up to?) three versions of each day for dates between 2021-01-01 and 2021-05-15.

    Except for the dates where all three files are all trivial (59 bytes, presumably just a column/header line), the three files seem to have different sizes. Here is a snip of "aws s3 ls" output:

    2022-12-28 11:43:52 16234828 stories_2021-04-05.csv 2022-12-28 11:43:53 80451582 stories_2021-04-05_v2.csv 2022-12-28 11:43:53 72063717 stories_2021-04-05_v3.csv

    2022-12-28 11:43:54 16485868 stories_2021-04-06.csv 2022-12-28 11:43:56 96556995 stories_2021-04-06_v2.csv 2022-12-28 11:43:57 78333686 stories_2021-04-06_v3.csv

    2022-12-28 11:43:57 16470124 stories_2021-04-07.csv 2022-12-28 11:43:59 97247724 stories_2021-04-07_v2.csv 2022-12-28 11:43:59 77504728 stories_2021-04-07_v3.csv

    2022-12-28 11:43:59 16496389 stories_2021-04-08.csv 2022-12-28 11:44:02 79591431 stories_2021-04-08_v2.csv 2022-12-28 11:44:04 59 stories_2021-04-08_v3.csv

    2022-11-23 01:14:05 59 stories_2021-04-09.csv 2022-12-28 11:44:05 59 stories_2021-04-09_v2.csv 2022-12-28 11:44:05 59 stories_2021-04-09_v3.csv

    2022-11-23 01:14:05 59 stories_2021-04-10.csv 2022-12-28 11:44:05 59 stories_2021-04-10_v2.csv 2022-12-28 11:44:05 59 stories_2021-04-10_v3.csv

    2022-11-23 01:14:06 59 stories_2021-04-11.csv 2022-12-28 11:44:05 59 stories_2021-04-11_v2.csv 2022-12-28 11:44:06 59 stories_2021-04-11_v3.csv

    2022-11-23 01:14:06 59 stories_2021-04-12.csv 2022-12-28 11:44:06 59 stories_2021-04-12_v2.csv 2022-12-28 11:44:06 59 stories_2021-04-12_v3.csv

    2022-12-28 11:44:06 16910341 stories_2021-04-13.csv 2022-12-28 11:44:06 59 stories_2021-04-13_v2.csv 2022-12-28 11:44:06 154798430 stories_2021-04-13_v3.csv

    2022-12-28 11:44:06 17830103 stories_2021-04-14.csv 2022-12-28 11:44:06 117418301 stories_2021-04-14_v2.csv 2022-12-28 11:44:08 76491043 stories_2021-04-14_v3.csv

    The "v2" file seems to be the largest in MOST cases, but see 4-13 above as an exception.

    If we process more than one file for each date, it seems possible/likely that we could download each HTML file as many as three times.

    hist-queuer.py eliminates downloading S3 objects that are for the same remote URL (the old system downloaded a story each time it appeared in a different feed), but cannot look across input csv files.

    Does anyone remember how the different versions came about?

thepsalmist commented 5 months ago

Does anyone remember how the different versions came about?

The versions were from batching the CSV e.g 00:00-12:00 and 12:00 to 23:59 to avoid Postgres query timeouts. Script to combine the csvs to a single version should fit

philbudne commented 5 months ago

Script to combine the csvs to a single version should fit

Not strictly necessary: the queuer doesn't check file suffixes (.csv).

The only advantage would be eliminating duplicates (the legacy system downloaded a story multiple times if it appeared in multiple RSS feeds).

philbudne commented 3 months ago

The hist- stack processing the "Database D" csv files (for 2022 and 2021) has completed processing of 2021/12/31 back to 2021/12/27. BUT hist-fetcher was unhappy with 2021/12/26 (looks like it tossed everything from that day into quarantine).

philbudne commented 3 months ago

The queuer processes files in reverse lexicographic (character set) order. This is my analysis of the order of the chunks to process (from top to bottom this time) to process the year:

status bucket prefix object name format start end notes
done mediacloud-database-d-files stories_2021_mm_dd 2021/12/26 2021/12/31 *️
see below mediacloud-database-files /2021/stories_2021-1 stories_2021-mm-dd 2021/10/14 2021/12/25  † *️
see below mediacloud-database-c-files 2021_mm_dd.csv 2021/11/12 2021/11/21
see below mediacloud-database-b-files /stories_2021- stories_2021-mm-dd 2021/10/14 2021/11/11
see below mediacloud-database-b-files /stories2021 stories_2021_mm_dd 2021/09/15 2021/10/13 *️
done mediacloud-database-files /2021/stories_2021-0 stories_2021-mm-dd 2021/01/01 2021/09/14

*️NOTE: DB D/B overlap periods are 2021-09-15 thru 2021-11-11 (DB B) and 2021-12-26 thru 2022-01-25 (DB D)

† empty files for 1/31, 4/8 thru 4/12, missing 9/15 thru 10/13 (??)

see https://github.com/mediacloud/story-indexer/issues/329 for other ranges that need downloads_ids

december 25 to october 14

status start end notes
X 2021-11-12 2021-12-25 need downloads_ids (from DB F)
X 2021-11-21 2021-11-12 needs downloads_ids (from DB C)
running 2021-11-11 2021-09-14 available in mc-db-b-files AND mc-db-files? *️
philbudne commented 3 months ago

I tested 10/1/2021 (epoch B) in my dev stack, pulled main from upstream, merged main to staging, and launched a staging stack on bernstein:

./docker/deploy.sh -T historical -Y 2021 -I s3://mediacloud-database-files/2021/stories_2021-1

I also removed the production hist-indexer stack.

My normal steps when deploying ANY stack. After a few minutes:

  1. check for containers that were recently launched and have recently exited with non-zero status
  2. tail the messages.log file to look for "normal" operation (fetching, parsing, importing)
  3. watch grafana for 10 minutes, looking for: smooth, continuous operation and no bouncing in the numbers of containers

In this case, grafana showed fetcher activity, but no parser activity: the hist-fetcher had reported all stories as "bad-dlid"

philbudne commented 3 months ago

The errors look like:

2024-08-09 19:55:08,237 82f8da64c4d1 hist-fetcher INFO: bad-dlid: EMPTY
2024-08-09 19:55:08,238 82f8da64c4d1 hist-fetcher INFO: quarantine: QuarantineException('bad-dlid')

I downloaded the csv file:

aws s3 cp s3://mediacloud-database-files/2021/stories_2021-12-25.csv .

and I don't see a downloads_id:

collect_date,stories_id,media_id,url
2021-12-25 08:18:56.292171,2147483646,272136,https://observador.pt/2021/12/25/covid-19-coordenador-cientifico-italiano-considera-reforco-da-vacina-crucial-contra-omicron/
2021-12-25 08:18:56.291037,2147483645,375830,https://www.sudouest.fr/gironde/gujan-mestras/bassin-d-arcachon-une-cabane-en-feu-a-gujan-mestras-7452333.php

Same for the 24th:

pbudne@ifill:~$ head -3 stories_2021-12-24.csv 
collect_date,stories_id,media_id,url
2021-12-24 23:49:57.083076,2147254999,655701,https://ulan.mk.ru/video/2021/12/25/pervaya-godovshhina-podnyatiya-andreevskogo-flaga-na-korvete-geroy-rossiyskoy-federacii-aldar-cydenzhapov.html
2021-12-24 23:49:57.014398,2147254998,655701,https://kavkaz.mk.ru/social/2021/12/25/student-stavropolskogo-filiala-rankhigs-prinyal-uchastie-v-forume-studaktiva.html
philbudne commented 3 months ago

Looking at the other end of mediacloud-database-files/2021, at 2021-01-01:

There are three files, all have downloads_id:

phil@p27:~$ head -1 stories_2021-01-01*
==> stories_2021-01-01.csv <==
collect_date,stories_id,media_id,downloads_id,feeds_id,url

==> stories_2021-01-01_v2.csv <==
collect_date,stories_id,media_id,downloads_id,feeds_id,url

==> stories_2021-01-01_v3.csv <==
collect_date,stories_id,media_id,downloads_id,feeds_id,url

Across the three files, only 86% are unique:

phil@p27:~$ wc -l stories_2021-01-01*
   100001 stories_2021-01-01.csv
   328389 stories_2021-01-01_v2.csv
   257678 stories_2021-01-01_v3.csv
   686068 total
phil@p27:~$ sort -u stories_2021-01-01* | wc -l
588464

Looks like the overlap is between, not within the files:

phil@p27:~$ for x in stories_2021-01-01*; do
> echo $x; sort -u $x | wc -l
> done
stories_2021-01-01.csv
100001
stories_2021-01-01_v2.csv
328389
stories_2021-01-01_v3.csv
257678
philbudne commented 3 months ago

Looks like downloads_id starts being present in mid November:

phil@p27:~$ head -1 stories_2021-11-*
==> stories_2021-11-11.csv <==
collect_date,stories_id,media_id,downloads_id,feeds_id,url

==> stories_2021-11-23.csv <==
collect_date,stories_id,media_id,url

Generated at different times (from different databases?):

phil@p27:~$ aws s3 ls s3://mediacloud-database-files/2021/| grep 2021.11
2022-11-23 08:23:01  171607723 stories_2021-11-01.csv
2022-11-23 08:23:04  190802596 stories_2021-11-02.csv
2022-11-23 08:23:04  192421552 stories_2021-11-03.csv
2022-11-23 08:23:06  187801571 stories_2021-11-04.csv
2022-11-23 08:23:07  177020925 stories_2021-11-05.csv
2022-11-23 08:23:18  129040813 stories_2021-11-06.csv
2022-11-23 08:23:22  138231824 stories_2021-11-07.csv
2022-11-23 08:23:22  190744853 stories_2021-11-08.csv
2022-11-23 08:23:23  196410284 stories_2021-11-09.csv
2022-11-23 08:23:27  199264106 stories_2021-11-10.csv
2022-11-23 08:23:32  189706034 stories_2021-11-11.csv
2023-02-17 00:53:13  190362479 stories_2021-11-23.csv
2023-02-17 00:53:13  257671902 stories_2021-11-24.csv
2023-02-17 00:53:13  160786079 stories_2021-11-25.csv
2023-02-17 00:53:13  151860953 stories_2021-11-26.csv
2023-02-17 00:53:13  109919256 stories_2021-11-27.csv
2023-02-17 00:53:16  112239747 stories_2021-11-28.csv
2023-02-17 00:53:16  163218600 stories_2021-11-29.csv
2023-02-17 00:53:16  181842769 stories_2021-11-30.csv
pgulley commented 3 months ago

What does that download_id represent- is that just an index value from the old system? I assume some change to the indexer will be necessary in order to re-index without that value, but am I right to say that it's not really necessary in the new index?

philbudne commented 3 months ago

@pgulley downloads_id is the key (object/file name) for the saved HTML in the mediacloud-downloads-backup S3 bucket (necessary to retrieve the HTML, otherwise not of use to the new system)

pgulley commented 3 months ago

oh interesting- does that mean we don't have html saved for December 2021 then?

philbudne commented 3 months ago

oh interesting- does that mean we don't have html saved for December 2021 then?

The HTML is on S3, we just don't have the keys to retrieve it ready to use. The question is whether we still have backups of (one of) the PG database(s) that can link the downloads_id, URL and date range.

philbudne commented 3 months ago

Going back to my goal of testing my recent hist-fetcher fixes, I've launched a staging stack on bernstein for just October CSV files:

./docker/deploy.sh -T historical -Y 2021 -I s3://mediacloud-database-files/2021/stories_2021-10
philbudne commented 3 months ago

And now running a staging stack (50K stories) for a date inside DB-B dl_id overlap range:

./docker/deploy.sh -T historical -Y 2021 -I s3://mediacloud-database-b-files/stories_2021_10_01.csv
philbudne commented 3 months ago

January 2021 has finished. Just merged main to staging, and launched a staging stack on bernstein:

root@bernstein:/nfs/ang/users/pbudne/story-indexer# ./docker/deploy.sh -T historical -Y 2021 -H /stories_2021-0
upstream staging branch up to date.
creating docker-compose.yml
cloning story-indexer-config repo
QUEUER_ARGS --force --sample-size 50000 s3://mediacloud-database-files/2021/stories_2021-0

Logs show it starting from mid-September:

2024-08-17 21:35:58,500 1d54803aa1ef hist-queuer INFO: process_file s3://mediacloud-database-files/2021/stories_2021-09-14.csv
philbudne commented 3 months ago

Created https://github.com/mediacloud/story-indexer/issues/328 on some observations about hist ingest.

philbudne commented 2 months ago

Looking at the remaining 2021 data hole: image

(date range 2021-11-19 thru 2021-12-26). The HTML (and possibly RSS) files are on S3:

downloads_id isodate unix_ts S3 Version Id
3306845683 2021-11-18T23:59:57+00:00 1637279997.0
3306845686 2021-11-18T23:59:59+00:00 1637279999.0
3306845687 2021-11-19T00:00:00+00:00 1637280000.0
3306845689 2021-11-19T00:00:02+00:00 1637280002.0
3360714567 2021-12-26T10:32:56+00:00 1640514776.0
3360714571 2021-12-26T10:32:56+00:00 1640514776.0
3360714572 2021-12-26T10:32:55+00:00 1640514775.0 8wiLKAcAGi5E14BySGDraSCl6fSe00DW
3360714573 2021-12-26T10:33:09+00:00 1640514789.0 DhkyUneyXRbliInFgDRV8yKyh6ggM1dz
3360714572 2022-01-29T00:21:20+00:00 1643415680.0 hWCsDiDn6QouyLe89IS6cU2oMrqIWzZ8
3360714573 2022-01-29T00:21:28+00:00 1643415688.0 vrbK1TSlTaYa5raJrLWpi6k0J0xh5uT2
3360714574 2022-01-29T00:21:21+00:00 1643415681.0
3360714577 2022-01-29T00:21:20+00:00 1643415680.0

With, or without the RSS, it may be possible to recover some significant percentage of the HTML using "cannonical link" tags...