Re-filling the feb-may 2022 "dip" using canonical URL extraction

Given the successful recovery using "blind fetching" S3 objects and using the extracted canonical URL, I suggested we might want to do the same thing for the period in 2022 (approx 2022-01-25 thru 2022-05-05?) where Xavier fetched all the S3 objects, looking only at the RSS files to extract URLs that we tried to fetch again (and found significant "link rot").

It looks like the researchers would prefer filling out 2022 to working on prior years (2019 and earlier).

To see if it might be a savings (assuming my memory that access to S3 is free from EC2 instances is correct) to scan the S3 objects from an EC2 instance, and pack them (without ANY further processing, including checking if the file has a canonical URL) into WARC files, I threw together a packer.py script with bits cribbed from hist-fetcher and parser (for RSS detection).

To get 100 HTML files it scanned 767 S3 objects (some non-existent, some 36 bytes or smaller which almost certainly contain a message that says it was a duplicate feed download), downloading a total of 5676225 bytes (avg 7400 bytes/obj scanned), and writing a WARC file that's 2726394 bytes (48% the size), so it might be worthwhile (with more consideration and math).

Then I created a t4g.nano instance ($0.0042/hr) to see how much faster downloads are from inside AWS, and it took about half the time (23 seconds vs 48seconds from ifill). That doesn't include additional time for the EC2 instance to copy the WARC file to S3.

Further data points:

My initial estimate (working only from the previous run of a one month period, and factoring in halving the speed to avoid hogging UMass bandwidth) was that it could take 56 days.

Poking around in the S3 bucket, it looks like the object ID range is about 113 million objects to be scanned; at 50 downloads/second (current historical ingest rate with 6 fetchers), that looks like it could be only 26 days.

So six "packers" (each given a share of the object ID range) running in EC2 at 33 obj/second is 200 obj/second, and 113Mobj divided by 200 obj/sec looks to be about a week of EC2 time.

A t3a.xlarge instance (4 AMD CPUs) is $0.15/hr, which would be $25 for a week (not counting EBS costs for the root disk).

Amazon pricing usually doubles for a doubling in resources, so the total price might be the same for different instance sizes, the instance size just determines the speed (assuming there isn't some other bottleneck).

With the 7400 bytes/obj number from above, at 113M objects, that's 836GB of download to transfer the raw objects, The WARC file came in at 3555 bytes/object or 402GB to download.

Processing the packed WARC files should be much like any other historical ingest (altho it will require a different stack flavor), and I'd expect that we would be able to process at the same rate (a month every 4 days at 50 stories/second), so 12 days. The arch-queuer shouldn't need any changes, and it can scan an S3 bucket for new additions, so the pipeline could run at the same time as the EC2 processing.

It looks like we transferred 2TB/mo out of AWS in Sept and October, that puts us in the $0.09/GB bracket, so a savings of 434GB would be $39, and at LEAST at $25 EC2 cost, means a savings of at most $14.

Running ad-hoc packers (as opposed to a rabbitmq based pipeline/stack) has the disadvantage that if the packer processes quit, they wouldn't be able to pick up where they left off without some record keeping. To get the RSS filtering capability we'd need a worker that does just that, or a parser option that says to do ONLY that!

One thing I haven't examined is if how many duplicate stories we might end up with (the canonical URL differs from the final URL we get when downloading using the RSS file URL); I haven't looked at whether we could delete the stories previously fetched using Xavier's CSV files. One way would be to look at the WARC files written when the CSVs were processed, but there might be other ways (looking at indexed_date and published_date?)

I used an ARM64 instance, initially with IPv6 only, running Ubuntu 24.04, and had some "fun":

github does not speak IPv6!
cchardet didn't want to build on Python 3.12
probably some other things I'm forgetting

mediacloud / story-indexer

Re-filling the feb-may 2022 "dip" using canonical URL extraction #353