Open tef opened 11 years ago
Many users of the warc library would need to have parsed http headers, so it would be nice to at least have a convenience function to do so. In addition, it might by useful to have a function to stream through the payload and calculate sha1 if the WARC-Payload-Digest header is not present.
I have some changes that implement parsing of http records and calculating sha1 while streaming the payload. However, this happens internal in the library and these changes are not suitable for upstream. https://bitbucket.org/rajbot/warc-tools
The warc library at https://github.com/internetarchive/warc has a number of these features.
It's GPLed. This is MIT licensed.
Edit: For the record the other major difference is that this library has had to handle more corrupt warcfiles, or weirder variants
(and the http library handles far too much weirdness)
That said, the interface to warc is /far/ nicer.
Any improvements we can make to mean that large and gargantuan warc files can be read and processed speedily