harvard-lil / scoop

🍨 High-fidelity, browser-based, single-page web archiving library and CLI for witnessing the web.
MIT License
117 stars 8 forks source link

Scoop 🍨

npm version JavaScript Style Guide Linting Test suite

High-fidelity, browser-based, single-page web archiving library and CLI.

Use it in the terminal...

scoop "https://lil.law.harvard.edu"

... or in your Node.js project

import { Scoop } from '@harvard-lil/scoop'

const capture = await Scoop.capture('https://lil.law.harvard.edu')
const wacz = await capture.toWACZ()

Perma Tools


Summary


About

Scoop is a high fidelity, browser-based, web archiving capture engine for witnessing the web from the Harvard Library Innovation Lab.

Fine-tune this custom web capture software to create robust single-page captures of the internet with accurate and complete provenance information.

With extensive options for asset formats and inclusions, Scoop will create .warc, warc.gz or .wacz files to be stored by users and replayed using the web archive replay software of their choosing.

Scoop also comes with built-in support for the WACZ Signing and Verification specification, allowing users to cryptographically sign their captures.

More info:

👆 Back to the summary


Main Features

Examples and screenshots

👆 Back to the summary


Getting started

Dependencies and requirements

Scoop requires Node.js 18+.

Other recommended system-level dependencies: curl, python3 (for --capture-video-as-attachment option).

While the amount of resources Scoop needs is entirely dependent on what is being captured, a minimum of 4GB of RAM seems to be indicated for complex captures.

Compatibility

This program has been written for UNIX-like systems and is expected to work on Linux, Mac OS, and Windows Subsystem for Linux.

Installation

Scoop is available on npmjs.org and can be installed as follows:

# As a CLI
npm install -g @harvard-lil/scoop

# As a library
npm install @harvard-lil/scoop --save

# In both cases, you may need to install Playwright's dependencies: 
sudo npx playwright install-deps chromium
Trouble installing the CLI? - Make sure you are running Node JS 18+ (`node -v`) - Permissions issues are a common when installing `npm` packages globally for the first time. See [npm's documentation](https://docs.npmjs.com/resolving-eacces-permissions-errors-when-installing-packages-globally) for solutions. - On certain systems, using `install-deps` without the `chromium` argument might be necessary: ```bash sudo npx playwright install-deps ``` - [npx may be used](https://docs.npmjs.com/cli/v9/commands/npx) as an alternative to a global installation: ```bash # In a new folder npm init npm install @harvard-lil/scoop npx scoop "https://example.com" ```

👆 Back to the summary


Using Scoop on the command line

Here are a few examples of how the scoop command can be used to make a customized capture of a web page.

# This will capture a given url using the default settings.
scoop "https://lil.law.harvard.edu" 

# Unless specified otherwise, scoop will save the output of the capture as "./archive.wacz".
# We can change this with the `--output` / `-o` option
scoop "https://lil.law.harvard.edu" -o my-collection/lil.wacz

# But what if I want to change the output format itself?
scoop "https://lil.law.harvard.edu" -f warc -o my-collection/lil.warc

# By default, Scoop runs in headless mode. 
# I can turn the "headless" flag off to see what happens in Chromium during capture.
scoop "https://lil.law.harvard.edu" --headless false

# Although it comes with "good defaults", scoop is highly configurable ...
# timeout-related options are good 
scoop "https://lil.law.harvard.edu" --capture-video-as-attachment false --screenshot false --capture-window-x 320 --capture-window-y 480 --capture-timeout 30000 --max-capture-size 100000 --signing-url "https://example.com/sign"

# ... use --help to list the available options, and see what the defaults are.
scoop --help

# Timeout-related options are good dials to turn first when trying to customize "how much" of a page to capture.
scoop "https://lil.law.harvard.edu" --capture-timeout 90000 --load-timeout 60000 --network-idle-timeout 30000
See: Output of scoop --help 🔍 ``` Usage: scoop [options] 🍨 High-fidelity, browser-based, single-page web archiving library and CLI. More info: https://github.com/harvard-lil/scoop Options: -v, --version Display Scoop and Scoop CLI version. -o, --output Output path. (default: "./archive.wacz") -f, --format Output format. (choices: "warc", "warc-gzipped", "wacz", "wacz-with-raw", default: "wacz") --json-summary-output If set, allows for saving a capture summary as JSON. Must be a path to .json file. --export-attachments-output If set, allows for exporting attachments (screenshot, certs, ...). Must be a path to an existing directory. --signing-url Authsign-compatible endpoint for signing WACZ file. --signing-token Authentication token to --signing-url, if needed. --screenshot Add screenshot step to capture? (choices: "true", "false", default: "true") --pdf-snapshot Add PDF snapshot step to capture? (choices: "true", "false", default: "false") --dom-snapshot Add DOM snapshot step to capture? (choices: "true", "false", default: "false") --capture-video-as-attachment Add capture video(s) as attachment(s) step to capture? (choices: "true", "false", default: "true") --capture-certificates-as-attachment Add capture certificate(s) as attachment(s) step to capture? (choices: "true", "false", default: "true") --provenance-summary Add provenance summary to capture? (choices: "true", "false", default: "true") --attachments-bypass-limits If active, attachments will not count towards time and size constraints imposed on capture (--capture-timeout, --max--capture-size). (choices: "true", "false", default: "true") --capture-timeout Maximum time allocated to capture process before hard cut-off, in ms. (default: 60000) --load-timeout Max time Scoop will wait for the page to load, in ms. (default: 20000) --network-idle-timeout Max time Scoop will wait for the in-browser networking tasks to complete, in ms. (default: 20000) --behaviors-timeout Max time Scoop will wait for the browser behaviors to complete, in ms. (default: 20000) --capture-video-as-attachment-timeout Max time Scoop will wait for the video capture process to complete, in ms. (default: 30000) --capture-certificates-as-attachment-timeout Max time Scoop will wait for the certificates capture process to complete, in ms. (default: 10000) --capture-window-x Width of the browser window Scoop will open to capture, in pixels. (default: 1600) --capture-window-y Height of the browser window Scoop will open to capture, in pixels. (default: 900) --max-capture-size Size limit for the capture's exchanges list, in bytes. (default: 209715200) --auto-scroll Should Scoop try to scroll through the page? (choices: "true", "false", default: "true") --auto-play-media Should Scoop try to autoplay `

👆 Back to the summary


Using Scoop as a JavaScript library

Scoop can be used as a library in a Node.js project. Here are a few examples of how to programmatically capture web pages using the Scoop.capture() method, which returns an instance of the Scoop class.

const capture = await Scoop.capture(url, options)

Quick access

Example: Capture with default settings

import fs from 'fs/promises'
import { Scoop } from '@harvard-lil/scoop'

try {
  const capture = await Scoop.capture('https://lil.law.harvard.edu')
  const wacz = await capture.toWACZ()
  await fs.writeFile('archive.wacz', Buffer.from(wacz))
} catch(err) {
  // ...
}

Example: Capture with custom settings

import fs from 'fs/promises'
import { Scoop } from '@harvard-lil/scoop'

try {
  const capture = await Scoop.capture('https://lil.law.harvard.edu', {
    screenshot: true,
    pdfSnapshot: true,
    captureVideoAsAttachment: false,
    captureTimeout: 120 * 1000,
    loadTimeout: 60 * 1000,
    captureWindowX: 320,
    captureWindowY: 480
  })

  const warc = await capture.toWARC()
  await fs.writeFile('archive.warc', Buffer.from(warc))
} catch(err) {
  // ...
}

Example: Working with a copy of default settings

import { Scoop } from '@harvard-lil/scoop'

try {
  // "options" will be a copy of Scoop's default settings
  const options = Scoop.defaults

  // It therefore becomes easier to inspect said defaults ...
  console.log(options)

  // ... and edit existing values
  options.pdfSnapshot = true
  options.blocklist.push('/https?:\/\/foo/')

  const capture = Scoop.capture('https://lil.law.harvard.edu', options)

  // ...
} catch(err) {
  // ...
}

Example: Using a signing server

import fs from 'fs/promises'
import { Scoop } from '@harvard-lil/scoop'

try {
  const capture = await Scoop.capture('https://lil.law.harvard.edu')

  const signedWacz = await capture.toWACZ(true, {
    url: 'https://example.com/sign',
    token: 'some-very-secret-token'
  })

  await fs.writeFile('archive.wacz', Buffer.from(signedWacz))
} catch(err) {
  // ...
}

👆 Back to the summary


FAQ

🚧 Under construction

What does "browser-based" capture mean? Is it using my browser?

Browser-based capture means that Scoop uses a browser - Chromium - to visit the web page to capture and collect resources.

Specifically, it uses an HTTP proxy to "intercept" network exchanges as early as possible and preserve them "as is".

flowchart LR
    A[Scoop]
    B[Playwright]
    C[Chromium]
    D[Website]
    E[HTTP Proxy]
    A <--> |Controls| B
    B <--> C
    C <--> D
    A <-.-> |Capture| E <-.-> C

The browser Scoop controls was installed specifically for programmatic access by Playwright, the underlying tool it uses to communicate with it, and is different from the default browser of the machine Scoop is running on. Additionally, Scoop creates a single-use, isolated browsing context for every capture it makes.

More info:

Can I capture content behind login / password with Scoop?

Not yet - for security reasons - but we're working on it.

Although Playwright supports loading browser profiles doing so:

Help us design this feature: https://github.com/harvard-lil/scoop/issues/118

Does Scoop capture everything through a browser?

Yes, and unless specified otherwise.

Namely:

Exchanges captured in that context still go through Scoop's HTTP proxy, with the exception of crip.

flowchart LR
    A[Scoop]
    B[curl]
    C[Resource]
    D[HTTP Proxy]
    A <--> |Controls| B
    B <--> C
    A <-.-> |Capture| D <-.-> B

What is "WACZ with RAW exchanges"?

The includeRaw option of Scoop.toWACZ() allows for adding a folder named "raw" in the WACZ file, which contains a copy of unprocessed HTTP exchanges coming directly from Scoop's HTTP proxy.

This feature may be used to preserve finer elements that would otherwise be lost, such as ill-formed HTTP headers, and could be relevant in certain contexts such as forensic analysis.

In order to prevent unnecessary use of storage, Scoop only keeps in "/raw" the contents of exchanges it assesses are presented differently in WARCs. In practice, this most often means the bodies of HTTP exchanges are not included in the "/raw" files because the WARCs already contain the same data.

Experimental: WACZ files stored with the includeRaw option can be ingested by Scoop for analysis and processing via the Scoop.fromWACZ() method.

Should I run Scoop in headful mode?

In certain cases, running Scoop in "headful" mode might yield better results.

Passing --headless false to the CLI or { headless: false } to the library will instruct Scoop to run Chromium in headful mode.

Simulating a graphical output is necessary when running Scoop in headful mode on a server. The following command can be used for that purpose:

xvfb-run --auto-servernum -- scoop "https://lil.law.harvard.edu" --headless false

👆 Back to the summary


Development

Standard JS

This codebase uses the Standard JS coding style.

JSDoc

JSDoc is used for both documentation and loose type checking purposes on this project.

Testing

This project uses Node.js' built-in test runner.

npm run test

Tests-specific environment variables

The following environment variables allow for testing features requiring access to a third-party server.

These are optional, and can be added to a local .env file which will be automatically interpreted by the test runner.

Name Description
TEST_WACZ_SIGNING_URL URL of an authsign-compatible endpoint for signing WACZ files.
To run such an endpoint locally, use npm run dev-signer, which will overwrite .env and set this variable to http://localhost:5000/sign; see .services/signer.
TEST_WACZ_SIGNING_TOKEN If required by the server at TEST_WACZ_SIGNING_URL, an authentication token.

Available CLI

# Runs test suite
npm run test

# Runs linter
npm run lint

# Runs linter and attempts to automatically fix issues
npm run lint-autofix

# Runs a local instance of wacz-signer for test purposes (see "Testing" section)
npm run dev-signer

# Step-by-step NPM publishing helper
npm run publish-util

👆 Back to the summary