A Java library for reading and writing WARC files. This library includes a high level API modeling the standard record types as individual classes with typed accessors. The API is exensible and you can register extension record types and accessors for extension header fields.
try (WarcReader reader = new WarcReader(FileChannel.open(Paths.get("example.warc")))) {
for (WarcRecord record : reader) {
if (record instanceof WarcResponse && record.contentType().base().equals(MediaType.HTTP)) {
WarcResponse response = (WarcResponse) record;
System.out.println(response.http().status() + " " + response.target());
}
}
}
It uses a finite state machine parser generated from a strict grammar using Ragel. There is an optional lenient mode which can handle some forms of non-compliant WARC records. ARC and HTTP parsing is lenient by default.
Gzipped records are automatically decompressed. The parser interprets ARC/1.1 record as if they are a WARC dialect and populates the appropriate WARC headers.
All I/O is performed using NIO and an an effort is made to minimize data copies and share buffers whenever feasible. Direct buffers and even memory-mapped files can be used, but only with uncompressed WARCS until they're supported by Inflater (coming in JDK 11).
To use as a library add jwarc as a dependency from Maven Central.
To use as a command-line tool install Java 8 or later, download the latest release jar and run it using:
java -jar jwarc-{version}.jar
If you would prefer to build it from source install JDK 8+ and Maven and then run:
mvn package
try (WarcWriter writer = new WarcWriter(System.out)) {
writer.fetch(URI.create("http://example.org/"));
}
// write a warcinfo record
// date and record id will be populated automatically if unset
writer.write(new Warcinfo.Builder()
.fields("software", "my-cool-crawler/1.0",
"robots", "obey")
.build());
// we can also supply a specific date
Instant captureDate = Instant.now();
// write a request but keep a copy of it to reference later
WarcRequest request = new WarcRequest.Builder()
.date(captureDate)
.target(uri)
.contentType("application/http")
.body(bodyStream, bodyLength)
.build();
writer.write(request);
// write a response referencing the request
WarcResponse response = new WarcResponse.Builder()
.date(captureDate)
.target(uri)
.contentType("application/http")
.body("HTTP/1.0 200 OK\r\n...".getBytes())
.concurrentTo(request.id())
.build();
writer.write(response);
The WarcFilter class provides a simple filter expression language for matching WARC records. For example here's a moderately complex filter which matches all records that are not image resources or image responses:
!((warc-type == "resource" && content-type =~ "image/.*") ||
(warc-type == "response" && http:content-type =~ "image/.*"))
WarcFilter implements Predicate<WarcRecord>
and be used to conveniently with streams of records:
long errorCount = warcReader.records().filter(WarcFilter.compile(":status >= 400")).count();
Their real power though is as a building block for user-supplied options.
jwarc also includes a set of command-lines tools which serve as examples. Note that many of the tools are lightweight demonstrations and may lack important options and features.
Capture a URL (without subresources):
java -jar jwarc.jar fetch http://example.org/ > example.warc
Create a CDX file:
java -jar jwarc.jar cdx example.warc > records.cdx
Run a replay proxy and web server:
export PORT=8080
java -jar jwarc.jar serve example.warc
Replay each page within in a WARC and use headless Chrome to render a screenshot and save it as a resource record:
export BROWSER=/opt/google/chrome/chrome
java -jar jwarc.jar screenshot example.warc > screenshots.warc
Running a proxy server which records requests and responses. This will generate self-signed SSL certificates so you will
will need turn off TLS verification in the client. For Chrome/Chromium use the --ignore-certificate-errors
command-line option.
export PORT=8080
java -jar jwarc.jar recorder > example.warc
chromium --proxy-server=http://localhost:8080 --ignore-certificate-errors
Record a command that obeys the http(s)_proxy and CURL_CA_BUNDLE environment variables:
java -jar jwarc.jar recorder -o example.warc curl http://example.org/
Capture a page by recording headless Chrome:
export BROWSER=/opt/google/chrome/chrome
java -jar jwarc.jar record > example.warc
Create a new file containing only html responses with status 200:
java -jar jwarc.jar filter ':status == 200 && http:content-type =~ "text/html(;.*)?"' example.warc > pages.warc
See the javadoc for more details.
new WarcReader(stream|path|channel); // opens a WARC file for reading
reader.close(); // closes the underlying channel
(WarcCompression) reader.compression(); // type of compression: NONE or GZIP
(Iterator) reader.iterator(); // an iterator over the records
(WarcRecord) reader.next(); // reads the next record
reader.registerType("myrecord", MyRecord::new); // registers a new record type
reader.setLenient(true); // enables lenient parsing mode
new WarcWriter(channel, NONE|GZIP); // opens a WARC file for writing
writer.fetch(uri); // downloads a resource recording the request and response
(long) writer.position(); // byte position the next record will be written to
writer.write(record); // adds a record to the WARC file
Message
HttpMessage
HttpRequest
HttpResponse
WarcRecord
Warcinfo (warcinfo)
WarcTargetRecord
WarcContinuation (continuation)
WarcConversion (conversion)
WarcCaptureRecord
WarcMetadata (metadata)
WarcRequest (request)
WarcResource (resource)
WarcResponse (response)
WarcRevisit (revisit)
The basic building block of both HTTP protocol and WARC file format is a message consisting of set of named header fields and a body. Header field names are case-insensitvie and may have multiple values.
(BodyChannel) message.body(); // the message body as a ReadableByteChannel
(long) message.body().position(); // the next byte position to read from
(int) message.body().read(byteBuffer); // reads a sequence of bytes from the body
(long) message.body().size(); // the length in bytes of the body
(InputStream) message.body().stream(); // views the body as an InputStream
(String) message.contentType(); // the media type of the body
(Headers) message.headers(); // the header fields
(List<String>) message.headers().all("Cookie"); // all values of a header
(boolean) message.headers().contains("TE", "deflate"); // tests if a value is present
(Optional<String>) message.headers().first("Cookie"); // the first value of a header
(Map<String,List<String>>) message.headers().map(); // views the header fields as a map
(Optional<String>) message.headers().sole("Location"); // throws if header has multiple values
(ProtocolVersion) message.version(); // the protocol version (e.g. HTTP/1.0 or WARC/1.1)
Methods available on all WARC records:
(Optional<Digest>) record.blockDigest(); // value of hash function applied to bytes of body
(Instant) record.date(); // instant that data capture began
(URI) record.id(); // globally unique record identifier
(Optional<Long>) record.segmentNumber(); // position of this record in segmentated series
(TuncationReason) record.truncated(); // reason record was truncated; or else NOT_TRUNCATED
(String) record.type(); // "warcinfo", "request", "response" etc
(Headers) warcinfo.fields(); // parses the body as application/warc-fields
(Optional<String>) warcinfo.filename(); // filename of the containing WARC
Methods available on all WARC records except Warcinfo:
(Optional<String>) record.identifiedPayloadType(); // media type of payload identified by an independent check
(String) record.target(); // captured URI as an unparsed string
(URI) record.targetURI(); // captured URI
(Optional<WarcPayload>) record.payload(); // payload
(Optional<Digest>) record.payloadDigest(); // value of hash function applied to bytes of the payload
(Optional<URI>) record.warcinfoID(); // ID of warcinfo record when stored separately
(String) continuation.segmentOriginId(); // record ID of first segment
(Optional<String>) continuation.segmentTotalLength(); // (last only) total length of all segments
(Optional<URI>) conversion.refersTo(); // ID of record this one was converted from
Methods available on metadata, request, resource and response records:
(List<URI>) capture.concurrentTo(); // other record IDs from the same capture event
(Optional<InetAddr>) capture.ipAddress(); // IP address of the server
(Headers) metadata.fields(); // parses the body as application/warc-fields
(HttpRequest) request.http(); // parses the body as a HTTP request
(BodyChannel) request.http().body(); // HTTP request body
(Headers) request.http().headers(); // HTTP request headers
No methods are specific to resource records. See WarcRecord, WarcTargetRecord, WarcCaptureRecord above.
(HttpResponse) response.http(); // parses the body as a HTTP response
(BodyChannel) response.http().body(); // HTTP response body
(Headers) response.http().headers(); // HTTP response headers
(HttpResponse) revisit.http(); // parses the body as a HTTP response
(Headers) revisit.http().headers(); // HTTP response headers (note: revisits never have a payload!)
(URI) revisit.profile() // revisit profile (not modified or identical payload)
(URI) revisit.refersTo(); // id of record this is a duplicate of
(URI) revisit.refersToTargetURI(); // targetURI of the referred to record
(Instant) revisit.refersToDate(); // date of the referred to record
Note: revisit records never have a payload so
Criteria | jwarc | JWAT | webarchive-commons |
---|---|---|---|
License | Apache 2 | Apache 2 | Apache 2 |
Parser based on | Ragel FSM | Hand-rolled FSM | Apache HTTP |
Push parsing | Low level | ✘ | ✘ |
Folded headers † | ✔ | ✔ | ✔ |
Encoded words † | ✘ | ✘ (disabled) | ✘ |
Validation | The basics | ✔ | ✘ |
Strict parsing ‡ | ✔ | ✘ | ✘ |
Lenient parsing | HTTP only | ✔ | ✔ |
Multi-value headers | ✔ | ✔ | ✘ |
I/O Framework | NIO | IO | IO |
Record type classes | ✔ | ✘ | ✘ |
Typed accessors | ✔ | ✔ | Some |
GZIP detection | ✔ | ✔ | Filename only |
WARC writer | Barebones | ✔ | ✔ |
ARC reader | Auto | Separate API | Factory |
ARC writer | ✘ | ✔ | ✔ |
Speed * (.warc) | 1x | ~5x slower | ~13x slower |
Speed * (.warc.gz) | 1x | ~1.4x slower | ~2.8x slower |
(†) WARC features copied from HTTP that have since been deprecated in HTTP. I'm not aware of any software that writes WARCs using these features and usage of them should probably be avoided. JWAT behaves differently from jwarc and webarchive-commons as it does not trim whitespace on folded lines.
(‡) JWAT and webarchive-commons both accept arbitrary UTF-8 characters in field names. jwarc strictly enforces the grammar rules from the WARC specification, although it does not currently enforce the rules for the values of specific individual fields.
(*) Relative time to scan records after JIT steady state. Only indicative. Need to redo this with a better benchmark. JWAT was configured with a 8192 byte buffer as with default options it is 27x slower. For comparison merely decompressing the .warc.gz file with GZIPInputStream is about 0.95x.
See also: Unaffiliated benchmark against other languages
More recent benchmarks against Java libraries