iipc / jwarc

Java library for reading and writing WARC files with a typed API
Apache License 2.0
46 stars 8 forks source link

Leverage gzip extra field "sl" to skip over compressed WARC records #16

Open sebastian-nagel opened 4 years ago

sebastian-nagel commented 4 years ago

WARC writers may provide a gzip extra field "sl" (recommended by WARC 0.9 but dropped in newer versions) to encode the length of the compressed WARC record. This can be used to quickly skip over the current record for tasks (eg. CDX indexing) which do not require to read the payload. See also #14/#15.