Improve the JSONL output

Following #299

[ ] Name fields so the common ones are consistent with the CDXJ specification, i.e. make this an extension of CDXJ.
[ ] Document usage, noting that dynamic content can't be easily extracted because it's dynamic.
[ ] Store content_first_bytes without spaces? Store content_ffb as well or instead of content_first_bytes?
[ ] Consider allowing payload inclusion if small, e.g. smaller HTML files or initial binary chunk.
[ ] Consider extending the API so consumers can use the reference (name/offset) to get the payload InputStream.
[ ] Include warcinfo records in the JSONL output (currently skipped by the windex.extract).
[ ] Should boiler_pipe extraction be used?
[ ] Should extracted links be normalised?
[ ] Should image and/or PDF analysis be enabled?
[ ] Should the original payload be included if small enough? Or just for text?
[ ] Should there be an option to only output the term frequency or colocation statistics of the text? So we can do this for everything? Perhaps that's better as a post-processing step?
[ ] Both the Tika configuration extract_all_metadata and the experimental WARCStats code show there are lots of other metadata fields that might be of interest. These could be stored in some kind of hash, but not that Parquet/Avro schema reflection does not support hashes directly. The MementoRecord class illustrates that the Memento bean could be implemented on top of an extensible hash-map, which might make dynamic Parquet schema generation possible.

Example WARC Stats code output:

INFO  WARCStatsToolIntegrationTest - {"timestamp":"20080430204830","url":"http:\/\/www.archive.org\/services\/collection-rss.php","source-file":"hdfs:\/\/localhost:58536\/user\/anj\/inputs\/IAH-20080430204825-00000-blackbook-truncated.warc.gz","content-type":"application\/http; msgtype=response","content-length":"50832","length":"50831","source-offset":"18283","HEADER-reader-identifier":"IAH-20080430204825-00000-blackbook-truncated.warc.gz","HEADER-WARC-Payload-Digest":"sha1:JXXJNHJX4GEM44C4NOM3RJWKMKVBIGHF","HEADER-WARC-IP-Address":"207.241.229.39","HEADER-absolute-offset":"18283","HEADER-WARC-Target-URI":"http:\/\/www.archive.org\/services\/collection-rss.php","HEADER-WARC-Date":"2008-04-30T20:48:30Z","HEADER-Content-Length":"50832","HEADER-WARC-Record-ID":"<urn:uuid:8399ab93-1fee-4787-aa60-0f1ce83cb885>","HEADER-WARC-Type":"response","HEADER-Content-Type":"application\/http; msgtype=response","record-type":"warc.response","digest":"sha1:JXXJNHJX4GEM44C4NOM3RJWKMKVBIGHF","status-code":"200","HTTP-Date":"Wed, 30 Apr 2008 20:48:29 GMT","HTTP-Server":"Apache\/2.0.54 (Ubuntu) PHP\/5.0.5-2ubuntu1.4 mod_ssl\/2.0.54 OpenSSL\/0.9.7g","HTTP-X-Powered-By":"PHP\/5.0.5-2ubuntu1.4","HTTP-Connection":"close","HTTP-Content-Type":"text\/xml","host":"www.archive.org","year":"2008"}
INFO  WARCStatsToolIntegrationTest - {"timestamp":"20080430204825","url":"http:\/\/www.archive.org\/robots.txt","source-file":"hdfs:\/\/localhost:58536\/user\/anj\/inputs\/IAH-20080430204825-00000-blackbook-truncated.warc.gz","content-type":"application\/http; msgtype=response","content-length":"782","length":"781","source-offset":"707","HEADER-reader-identifier":"IAH-20080430204825-00000-blackbook-truncated.warc.gz","HEADER-WARC-Payload-Digest":"sha1:SUCGMUVXDKVB5CS2NL4R4JABNX7K466U","HEADER-WARC-IP-Address":"207.241.229.39","HEADER-absolute-offset":"707","HEADER-WARC-Target-URI":"http:\/\/www.archive.org\/robots.txt","HEADER-WARC-Date":"2008-04-30T20:48:25Z","HEADER-Content-Length":"782","HEADER-WARC-Record-ID":"<urn:uuid:e7c9eff8-f5bc-4aeb-b3d2-9d3df99afb30>","HEADER-WARC-Type":"response","HEADER-Content-Type":"application\/http; msgtype=response","record-type":"warc.response","digest":"sha1:SUCGMUVXDKVB5CS2NL4R4JABNX7K466U","status-code":"200","HTTP-Date":"Wed, 30 Apr 2008 20:48:24 GMT","HTTP-Server":"Apache\/2.0.54 (Ubuntu) PHP\/5.0.5-2ubuntu1.4 mod_ssl\/2.0.54 OpenSSL\/0.9.7g","HTTP-Last-Modified":"Sat, 02 Feb 2008 19:40:44 GMT","HTTP-ETag":"\"47c3-1d3-11134700\"","HTTP-Accept-Ranges":"bytes","HTTP-Content-Length":"467","HTTP-Connection":"close","HTTP-Content-Type":"text\/plain; charset=UTF-8","host":"www.archive.org","year":"2008"}

ukwa / webarchive-discovery

Improve the JSONL output #307