harvard-lil / perma

Indelible links
423 stars 71 forks source link

Add sampling task #3657

Closed bensteinberg closed 3 days ago

bensteinberg commented 1 week ago

This is an experiment in asserting completeness of a backup of Perma WARC and WACZ files. We've done something like this before, by enumeration; one brute-force approach is to enumerate all objects in the S3 buckets and record their etags, then see if any are not present on the mirror, or present but not with a matching etag.

Another approach is to enumerate, from Perma's database, all the artifacts that should be present, get their etags from S3, etc. Enumeration obviously becomes a bigger and bigger task as the collection grows, involves redoing work, and I think is somewhat fragile.

My idea is to assert completeness statistically. My first basic notion was something like "The null hypothesis is that at least 1 in x files do not make it to the mirror. To disprove the null hypothesis, you'd sample n in x captured files and, not seeing a failure, show that the chance of not seeing a failure in the sample, given the null hypothesis, falls below the typical value, a little under two percent."

On rereading part of my intro stats book, I think a better way of putting this is in terms of a sampling distribution model for a proportion. The proportion here is the proportion of artifacts that are not present on the mirror; a sampling distribution is a simulation, the distribution of the proportions from all possible samples. The basic idea is that it is normally distributed and so can be characterized like this:

"Provided that the sampled values are independent and the sample size (n) is large enough, the sampling distribution model of p̂ (observed proportion in the sample) is modeled by a Normal model with mean(p̂) = p (model parameter aka hypothesized proportion) and standard deviation of p̂ = the square root of (p * (1 - p)) / n."

The provisos (independent, large enough) translate to the randomization condition (sample is chosen randomly), the 10% condition (sample size is less than 10% of the population), and the success/failure condition (the sample size has to be big enough to expect at least ten successes and ten failures). That is, we can't test for a failure rate of one in five million (roughly, the population size), as we couldn't generate a sample size large enough in any case, let alone keeping it under 10% of the population.

A tenth of five million is 500,000 -- that seems like way too many to test. Let's say n is 1,000, well under 10% of the population. Then, let's pick proportions for success and failure that will produce ten failures; 99% success rate and 1% failure rate would produce ten failures and 990 successes in a sample of 1,000. The standard deviation is the square root of ((0.99 * 0.01) / 1000), which is 0.00314642654451 or about a third of a percentage point (our unit of measure).

I expect that in a random sample of 1,000 links, the proportion of failures will be at least two standard deviations below the nominal 1% rate we're assuming here -- that is, that less than 2.5% of the time, a sample of this size drawn from a population with a true failure proportion of 1% would show fewer than four failures. In fact, I'm expecting our random sample to have one or zero failures, with only a one percent chance of occurring if the population actually had a 1% failure rate.

This PR adds an invoke task to generate a sample of Links, outputting object paths and etags. It generates a Python script to be run on the mirror, including the sample. That script compares the items in the sample with what is on disk, and reports successes, failures, standard deviation, and z-score; I think a second cut at this will add the probability of the seeing the proportion in the sample, probably using NormalDist from statistics, but I want to sanity-check this with a probability table first. :)

I am not sure about the pattern of generating a Python script as output; if that seems weird, I could generate the sample as JSON and keep the script somewhere else, maybe in services/ in this repo. I'm not even that sure about generating the data and putting it in /tmp, but we do that in other tasks.

It is quite possible this is wrong. If any of this is confusing, please let me know!

codecov[bot] commented 1 week ago

Codecov Report

Attention: Patch coverage is 0% with 59 lines in your changes missing coverage. Please review.

Project coverage is 69.08%. Comparing base (2f4f552) to head (6ccf8aa). Report is 7 commits behind head on develop.

Files with missing lines Patch % Lines
perma_web/tasks/dev.py 0.00% 59 Missing :warning:
Additional details and impacted files ```diff @@ Coverage Diff @@ ## develop #3657 +/- ## =========================================== - Coverage 69.64% 69.08% -0.56% =========================================== Files 54 54 Lines 7339 7398 +59 =========================================== Hits 5111 5111 - Misses 2228 2287 +59 ```

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.


🚨 Try these New Features:

bensteinberg commented 4 days ago

Separately, I've been considering removing the HistoricalLink model, as it is enormous, never gets used, and is unwieldy for backups and database upgrades -- but it strikes me that if the question is not "What links does Perma think have WARC or WACZ files" but "What links does Perma think have ever had WARC or WACZ files", the answer might be in HistoricalLink.

bensteinberg commented 3 days ago

After discussion on Slack, I'm hoping to get links and objects in the same way, but additionally get replaced objects by getting links.filter(capture_job__superseded=True) and finding the old object names. The old object is named something like W4LT-DTQ4_replaced_1721842551.278022.warc.gz but I have not been able to generate that timestamp from any time recorded in the Link object. I also think it's possible that there could be more than one replaced file?

bensteinberg commented 3 days ago

Ah yes, @rebeccacremona has already pointed out

I think the thing our database maybe cannot tell us is just how many replaced warcs there might be for any given link