Open sebastian-nagel opened 4 years ago
Having something like a decode()
or bodyDecoded()
convenience method on both HttpMessage and WarcPayload that decodes the content encoding seems reasonable to me.
record.payload().decode() -> MessageBody?
response.http().decode() -> MessageBody?
I think we could make brotli an optional maven dependency and if it's present on the classpath we use it.
read the (decoded) payload into byte[] (or ByteBuffer)
Note that from Java 9 you can do body().stream().readAllBytes()
and body().stream().readNBytes(buf, off, len)
. Not opposed to having our own as there's still quite a few people targeting 8 though.
Most consumers of the content payload require the payload to be
I've found myself writing similar code when consuming the payload body of WarcResponse records: jwarc's extract tool #41, a sitemap tester and StormCrawler. In order to make jwarc more usable, I'd propose to bundle the following functionality in two/few utility methods:
Content-Encoding
Content-Encoding
isn't understood or is not reliable (gzip without gzip magic/header)brotli
(I assume that jwarc is designed to have zero dependencies)