iipc / warc-specifications

Centralised repository for WARC usage specifications.
http://iipc.github.io/warc-specifications/
97 stars 28 forks source link

WIP: Document practices that may benefit from standardisation #54

Open anjackson opened 5 years ago

anjackson commented 5 years ago

We have a few cases where different tools are implementing shared use cases in slightly different WARC record structures. The purpose of this issue is to collect information on these variations so we can at least document their usage and prevent any further unnecessary variation. Understanding current usage should also set the stage for standardisation.

Crawl-time rendering artefacts

A number of organisations are now running web browsers during the crawl, and this provides an opportunity to preserve more information about how a site looked at the time it was captured.

WARC-Type Content-Type WARC-Target-URI Tool
resource application/pdf, text/html, image/png urn:X-wpull:snapshot?url=<ENCODED_URL> wpull
Also stores a WARC-Concurrent-To pointer to a snapshot action metadata record
resource image/jpeg screenshot:<CANONICAL_URL> Brozzler code
resource image/jpeg thumbnail:<CANONICAL_URL> Brozzler code
resource image/jpeg screenshot:<URL> UKWA code
resource application/pdf pdf:<URL> UKWA code
resource image/jpeg thumbnail:<URL> UKWA code
resource text/html; charset="utf-8" imagemap:<URL> UKWA code
resource application/json har:<URL> UKWA code
resource text/html onreadydom:<URL> UKWA code
resource image/png urn:view:<URL> browsertrix-crawler
resource image/png urn:fullPage:<URL> browsertrix-crawler
resource image/jpeg urn:thumbnail:<URL> browsertrix-crawler
conversion text/html; charset="utf-8" <URL> crocoite*
conversion image/png <URL> crocoite*
UMBRA?

*Note that crocoite uses additional record headers to indicate the type of the conversion record, e.g. X-Crocoite-Type': 'dom-snapshot

Web A/V Capture

WARC-Type Content-Type WARC-Target-URI Tool
metadata application/vnd.youtube-dl_formats+json metadata://<AUTHORITY_AND_RESOURCE> wpull, Heritrix3 ExtractorYoutubeDL module, Old Webrecorder
metadata application/vnd.youtube-dl_formats+json;charset=utf-8 youtube-dl:<CANONICAL_URL> Brozzler code
resource as found youtube-dl:<PLAYLIST_INDEX>:<WEBPAGE_URL> Brozzler code
Webrecorder

Crawl Logs

At UKWA we consider our crawl logs to be important artefacts, but we don't put them in WARC. Maybe we should?

WARC-Type Content-Type WARC-Target-URI Tool
resource text/plain urn:X-wpull:log wpull
metadata application/json ? crocoite

EDIT 2023-10-18: Updated with notes from comments.

PromyLOPh commented 5 years ago

For reference, crocoite is using conversion records to store screenshot and DOM snapshot and metadata records log entries, see https://github.com/PromyLOPh/crocoite/blob/master/crocoite/warc.py#L216 and https://github.com/PromyLOPh/crocoite/blob/master/crocoite/warc.py#L201 and https://github.com/PromyLOPh/crocoite/blob/master/crocoite/warc.py#L236

tw4l commented 1 year ago

Updating with Webrecorder's current practices for screenshots:

Crawl-time rendering artefacts

WARC-Type Content-Type WARC-Target-URI Tool
resource image/png urn:view:<URL> browsertrix-crawler
resource image/png urn:fullPage:<URL> browsertrix-crawler
resource image/jpeg urn:thumbnail:<URL> browsertrix-crawler
anjackson commented 10 months ago

I've attempted to update this with the information you provided, @PromyLOPh @tw4l .