iipc / jwarc

Java library for reading and writing WARC files with a typed API
Apache License 2.0
46 stars 8 forks source link

Utility methods to read payload body #48

Open sebastian-nagel opened 4 years ago

sebastian-nagel commented 4 years ago

Most consumers of the content payload require the payload to be

  1. decoded using the provided HTTP Content-Encoding
  2. available as byte[] (eg. Tika) or even String (eg. Jsoup)

I've found myself writing similar code when consuming the payload body of WarcResponse records: jwarc's extract tool #41, a sitemap tester and StormCrawler. In order to make jwarc more usable, I'd propose to bundle the following functionality in two/few utility methods:

ato commented 4 years ago

Having something like a decode() or bodyDecoded() convenience method on both HttpMessage and WarcPayload that decodes the content encoding seems reasonable to me.

record.payload().decode() -> MessageBody?
response.http().decode() -> MessageBody?

I think we could make brotli an optional maven dependency and if it's present on the classpath we use it.

read the (decoded) payload into byte[] (or ByteBuffer)

Note that from Java 9 you can do body().stream().readAllBytes() and body().stream().readNBytes(buf, off, len). Not opposed to having our own as there's still quite a few people targeting 8 though.