Closed sebastian-nagel closed 4 years ago
Nice!
the order of headers is not preserved when they're taken from WarcRecord
We also remove excess surrounding whitespace, unfold headers and if there's duplicate header field names with different case (WARC-CONCURRENT-TO, warc-concurrent-to) only one variant is kept. Maybe the parser should keep a copy of the raw header bytes for use cases where you want to copy or display the raw header unmodified.
Extract a WARC record given the record offset, inspired by warcio's extract tool.