Capture tech spec / tests / evaluation

matteocargnelutti commented 1 year ago

Work in progress 🚧

Identified capture constraints

Focus on single-page archives
Ability to limit capture time in seconds
Ability to limit total size in (kilo)bytes
Ability to generate a partial archive (If capture of resource interrupted because of time / size limits)
Ability to sign and verify archive
Ability to restrict access to certain IP ranges, protocols, specific urls, including for redirects
Ability to generate screenshots
Ability to generate .warc.gz and .wacz files.

What does Perma currently do? Answer: https://hlslil.slack.com/archives/C0175QU6D98/p1661811773089149

Evaluating browsertrix-crawler

🔔 Sept 16 Update: @matteocargnelutti: Temporary pausing this for a few days to explore alternatives.

Discussion: https://hlslil.slack.com/archives/C0175QU6D98/p1661541411348679

Setting limits:

Currently: Paused

[ ] How can we enforce a time limit on a given capture and still get an archive?

[ ] How can we enforce a size limit on a given capture and still get an archive?

[ ] What becomes of the resources that were being captured when the time or size limit is reached? What

Benchmarking and misc:

Currently: Paused

[ ] Benchmarking 10 urls representative of our use case on both Perma.cc and browsertrix

[ ] Can the docker-specific overhead of browsertrix be measured and alleviated (ie: batching)

[ ] Restricting access

[x] To protocols besides http(s)://

[ ] To entire IP ranges and urls

[ ] Ensure these restrictions apply even in redirects (ideally: enforced at container level)

Signing

🔔 Sept 16 Update: @matteocargnelutti: Research into archives signing using authsign are happening on a different project. Moving this discussion over there for the meantime, will report relevant findings here.

In parallel

🔔 Sept 16 Update: @matteocargnelutti: Wrapping up existing codebase and listing remaining capture goals before transferring to Greg for evaluation. How far is this from a suitable option, and what does that tell us about the direction we should be taking?

Currently: @matteocargnelutti , @leppert

[x] Prototype a simplistic single-page archiving system implementing these constraints using in-browser network interception The goal is mainly to help us better understand the process of network interception-based capture, its performance profile, and the mechanisms involved in dealing with both edge cases and the external constraints we want to apply to it.

matteocargnelutti commented 1 year ago

Re: Using `--exclude` / `--blockRules` to limit what the crawler can see

Allow list of protocols: good defaults ✅

By default, browsertrix only accepts http:// and https:// urls. This greatly reduces the risk of accidental capture of file:// or chrome:// urls for example.

Invalid Seed "chrome://settings" - URL must start with http:// or https://

Deny list for the main url to capture: need for external mechanism ❓

The --exclude param seem to be designed more as a way to exclude certain urls / paths "down the road", in a multi-page crawling scenario. I am not sure it matches our use case, and I think it would make sense to implement our own filtering at API level before adding an URL to the queue, for example to exclude certain IP ranges.

Deny list for urls used in `<iframe>`: option available ☑️

--blockRules works as intended for this use case. I was able to use it to prevent the crawler from capturing the content of an <iframe> which was pointing to a domain only accessible from the network the crawler is on.

The slight downside is that we'll likely spend quite some time devising and testing these regular expressions.

rebeccacremona commented 1 year ago

Allow list of protocols: good defaults ✅

By default, browsertrix only accepts http:// and https:// urls. This greatly reduces the risk of accidental capture of file:// or chrome:// urls for example.

(Belatedly capturing a discussion from last week or the week before)

Potentially to be discussed: how things go if you pass in basic auth. I think the Pydantic validator is fine with that, but am not sure... Perma presently forbids that for target URLs; I don't know to what extent we are committed to that decision/would need it to be enforced at this level.

harvard-lil / perma-capture