iipc / jwarc

Java library for reading and writing WARC files with a typed API
Apache License 2.0
46 stars 8 forks source link

Tool to extract a WARC record (or its headers or payload) #41

Closed sebastian-nagel closed 4 years ago

sebastian-nagel commented 4 years ago

Extract a WARC record given the record offset, inspired by warcio's extract tool.

ato commented 4 years ago

Nice!

the order of headers is not preserved when they're taken from WarcRecord

We also remove excess surrounding whitespace, unfold headers and if there's duplicate header field names with different case (WARC-CONCURRENT-TO, warc-concurrent-to) only one variant is kept. Maybe the parser should keep a copy of the raw header bytes for use cases where you want to copy or display the raw header unmodified.