High-fidelity, browser-based, single-page web archiving library and CLI.
Use it in the terminal...
scoop "https://lil.law.harvard.edu"
... or in your Node.js project
import { Scoop } from '@harvard-lil/scoop'
const capture = await Scoop.capture('https://lil.law.harvard.edu')
const wacz = await capture.toWACZ()
Scoop is a high fidelity, browser-based, web archiving capture engine for witnessing the web from the Harvard Library Innovation Lab.
Fine-tune this custom web capture software to create robust single-page captures of the internet with accurate and complete provenance information.
With extensive options for asset formats and inclusions, Scoop will create .warc, warc.gz or .wacz files to be stored by users and replayed using the web archive replay software of their choosing.
Scoop also comes with built-in support for the WACZ Signing and Verification specification, allowing users to cryptographically sign their captures.
More info:
.warc.
, .warc.gz
and .wacz
output formats
Scoop requires Node.js 18+.
Other recommended system-level dependencies:
curl, python3 (for --capture-video-as-attachment
option).
While the amount of resources Scoop needs is entirely dependent on what is being captured, a minimum of 4GB of RAM seems to be indicated for complex captures.
This program has been written for UNIX-like systems and is expected to work on Linux, Mac OS, and Windows Subsystem for Linux.
Scoop is available on npmjs.org and can be installed as follows:
# As a CLI
npm install -g @harvard-lil/scoop
# As a library
npm install @harvard-lil/scoop --save
# In both cases, you may need to install Playwright's dependencies:
sudo npx playwright install-deps chromium
Here are a few examples of how the scoop
command can be used to make a customized capture of a web page.
# This will capture a given url using the default settings.
scoop "https://lil.law.harvard.edu"
# Unless specified otherwise, scoop will save the output of the capture as "./archive.wacz".
# We can change this with the `--output` / `-o` option
scoop "https://lil.law.harvard.edu" -o my-collection/lil.wacz
# But what if I want to change the output format itself?
scoop "https://lil.law.harvard.edu" -f warc -o my-collection/lil.warc
# By default, Scoop runs in headless mode.
# I can turn the "headless" flag off to see what happens in Chromium during capture.
scoop "https://lil.law.harvard.edu" --headless false
# Although it comes with "good defaults", scoop is highly configurable ...
# timeout-related options are good
scoop "https://lil.law.harvard.edu" --capture-video-as-attachment false --screenshot false --capture-window-x 320 --capture-window-y 480 --capture-timeout 30000 --max-capture-size 100000 --signing-url "https://example.com/sign"
# ... use --help to list the available options, and see what the defaults are.
scoop --help
# Timeout-related options are good dials to turn first when trying to customize "how much" of a page to capture.
scoop "https://lil.law.harvard.edu" --capture-timeout 90000 --load-timeout 60000 --network-idle-timeout 30000
Scoop can be used as a library in a Node.js project.
Here are a few examples of how to programmatically capture web pages using the Scoop.capture()
method, which returns an instance of the Scoop
class.
const capture = await Scoop.capture(url, options)
Scoop.capture()
Scoop.toWACZ()
methodScoop.toWARC()
methodScoop.fromWACZ()
method (experimental)Scoop.state
propertyimport fs from 'fs/promises'
import { Scoop } from '@harvard-lil/scoop'
try {
const capture = await Scoop.capture('https://lil.law.harvard.edu')
const wacz = await capture.toWACZ()
await fs.writeFile('archive.wacz', Buffer.from(wacz))
} catch(err) {
// ...
}
import fs from 'fs/promises'
import { Scoop } from '@harvard-lil/scoop'
try {
const capture = await Scoop.capture('https://lil.law.harvard.edu', {
screenshot: true,
pdfSnapshot: true,
captureVideoAsAttachment: false,
captureTimeout: 120 * 1000,
loadTimeout: 60 * 1000,
captureWindowX: 320,
captureWindowY: 480
})
const warc = await capture.toWARC()
await fs.writeFile('archive.warc', Buffer.from(warc))
} catch(err) {
// ...
}
import { Scoop } from '@harvard-lil/scoop'
try {
// "options" will be a copy of Scoop's default settings
const options = Scoop.defaults
// It therefore becomes easier to inspect said defaults ...
console.log(options)
// ... and edit existing values
options.pdfSnapshot = true
options.blocklist.push('/https?:\/\/foo/')
const capture = Scoop.capture('https://lil.law.harvard.edu', options)
// ...
} catch(err) {
// ...
}
import fs from 'fs/promises'
import { Scoop } from '@harvard-lil/scoop'
try {
const capture = await Scoop.capture('https://lil.law.harvard.edu')
const signedWacz = await capture.toWACZ(true, {
url: 'https://example.com/sign',
token: 'some-very-secret-token'
})
await fs.writeFile('archive.wacz', Buffer.from(signedWacz))
} catch(err) {
// ...
}
🚧 Under construction
Browser-based capture means that Scoop uses a browser - Chromium - to visit the web page to capture and collect resources.
Specifically, it uses an HTTP proxy to "intercept" network exchanges as early as possible and preserve them "as is".
flowchart LR
A[Scoop]
B[Playwright]
C[Chromium]
D[Website]
E[HTTP Proxy]
A <--> |Controls| B
B <--> C
C <--> D
A <-.-> |Capture| E <-.-> C
The browser Scoop controls was installed specifically for programmatic access by Playwright, the underlying tool it uses to communicate with it, and is different from the default browser of the machine Scoop is running on. Additionally, Scoop creates a single-use, isolated browsing context for every capture it makes.
More info:
Not yet - for security reasons - but we're working on it.
Although Playwright supports loading browser profiles doing so:
Help us design this feature: https://github.com/harvard-lil/scoop/issues/118
Yes, and unless specified otherwise.
Namely:
Exchanges captured in that context still go through Scoop's HTTP proxy, with the exception of crip.
flowchart LR
A[Scoop]
B[curl]
C[Resource]
D[HTTP Proxy]
A <--> |Controls| B
B <--> C
A <-.-> |Capture| D <-.-> B
The includeRaw
option of Scoop.toWACZ()
allows for adding a folder named "raw" in the WACZ file, which contains a copy of unprocessed HTTP exchanges coming directly from Scoop's HTTP proxy.
This feature may be used to preserve finer elements that would otherwise be lost, such as ill-formed HTTP headers, and could be relevant in certain contexts such as forensic analysis.
In order to prevent unnecessary use of storage, Scoop only keeps in "/raw" the contents of exchanges it assesses are presented differently in WARCs. In practice, this most often means the bodies of HTTP exchanges are not included in the "/raw" files because the WARCs already contain the same data.
Experimental: WACZ files stored with the includeRaw
option can be ingested by Scoop for analysis and processing via the Scoop.fromWACZ()
method.
In certain cases, running Scoop in "headful" mode might yield better results.
Passing --headless false
to the CLI or { headless: false }
to the library will instruct Scoop to run Chromium in headful mode.
Simulating a graphical output is necessary when running Scoop in headful mode on a server. The following command can be used for that purpose:
xvfb-run --auto-servernum -- scoop "https://lil.law.harvard.edu" --headless false
This codebase uses the Standard JS coding style.
npm run lint
can be used to check formatting.npm run lint-autofix
can be used to check formatting and automatically edit files accordingly when possible.JSDoc is used for both documentation and loose type checking purposes on this project.
This project uses Node.js' built-in test runner.
npm run test
The following environment variables allow for testing features requiring access to a third-party server.
These are optional, and can be added to a local .env
file which will be automatically interpreted by the test runner.
Name | Description |
---|---|
TEST_WACZ_SIGNING_URL |
URL of an authsign-compatible endpoint for signing WACZ files. To run such an endpoint locally, use npm run dev-signer , which will overwrite .env and set this variable to http://localhost:5000/sign ; see .services/signer. |
TEST_WACZ_SIGNING_TOKEN |
If required by the server at TEST_WACZ_SIGNING_URL , an authentication token. |
# Runs test suite
npm run test
# Runs linter
npm run lint
# Runs linter and attempts to automatically fix issues
npm run lint-autofix
# Runs a local instance of wacz-signer for test purposes (see "Testing" section)
npm run dev-signer
# Step-by-step NPM publishing helper
npm run publish-util