ukwa / webarchive-discovery

WARC and ARC indexing and discovery tools.
https://github.com/ukwa/webarchive-discovery/wiki
116 stars 25 forks source link

Improve the JSONL output #307

Open anjackson opened 1 year ago

anjackson commented 1 year ago

Following #299

Example WARC Stats code output:

INFO  WARCStatsToolIntegrationTest - {"timestamp":"20080430204830","url":"http:\/\/www.archive.org\/services\/collection-rss.php","source-file":"hdfs:\/\/localhost:58536\/user\/anj\/inputs\/IAH-20080430204825-00000-blackbook-truncated.warc.gz","content-type":"application\/http; msgtype=response","content-length":"50832","length":"50831","source-offset":"18283","HEADER-reader-identifier":"IAH-20080430204825-00000-blackbook-truncated.warc.gz","HEADER-WARC-Payload-Digest":"sha1:JXXJNHJX4GEM44C4NOM3RJWKMKVBIGHF","HEADER-WARC-IP-Address":"207.241.229.39","HEADER-absolute-offset":"18283","HEADER-WARC-Target-URI":"http:\/\/www.archive.org\/services\/collection-rss.php","HEADER-WARC-Date":"2008-04-30T20:48:30Z","HEADER-Content-Length":"50832","HEADER-WARC-Record-ID":"<urn:uuid:8399ab93-1fee-4787-aa60-0f1ce83cb885>","HEADER-WARC-Type":"response","HEADER-Content-Type":"application\/http; msgtype=response","record-type":"warc.response","digest":"sha1:JXXJNHJX4GEM44C4NOM3RJWKMKVBIGHF","status-code":"200","HTTP-Date":"Wed, 30 Apr 2008 20:48:29 GMT","HTTP-Server":"Apache\/2.0.54 (Ubuntu) PHP\/5.0.5-2ubuntu1.4 mod_ssl\/2.0.54 OpenSSL\/0.9.7g","HTTP-X-Powered-By":"PHP\/5.0.5-2ubuntu1.4","HTTP-Connection":"close","HTTP-Content-Type":"text\/xml","host":"www.archive.org","year":"2008"}
INFO  WARCStatsToolIntegrationTest - {"timestamp":"20080430204825","url":"http:\/\/www.archive.org\/robots.txt","source-file":"hdfs:\/\/localhost:58536\/user\/anj\/inputs\/IAH-20080430204825-00000-blackbook-truncated.warc.gz","content-type":"application\/http; msgtype=response","content-length":"782","length":"781","source-offset":"707","HEADER-reader-identifier":"IAH-20080430204825-00000-blackbook-truncated.warc.gz","HEADER-WARC-Payload-Digest":"sha1:SUCGMUVXDKVB5CS2NL4R4JABNX7K466U","HEADER-WARC-IP-Address":"207.241.229.39","HEADER-absolute-offset":"707","HEADER-WARC-Target-URI":"http:\/\/www.archive.org\/robots.txt","HEADER-WARC-Date":"2008-04-30T20:48:25Z","HEADER-Content-Length":"782","HEADER-WARC-Record-ID":"<urn:uuid:e7c9eff8-f5bc-4aeb-b3d2-9d3df99afb30>","HEADER-WARC-Type":"response","HEADER-Content-Type":"application\/http; msgtype=response","record-type":"warc.response","digest":"sha1:SUCGMUVXDKVB5CS2NL4R4JABNX7K466U","status-code":"200","HTTP-Date":"Wed, 30 Apr 2008 20:48:24 GMT","HTTP-Server":"Apache\/2.0.54 (Ubuntu) PHP\/5.0.5-2ubuntu1.4 mod_ssl\/2.0.54 OpenSSL\/0.9.7g","HTTP-Last-Modified":"Sat, 02 Feb 2008 19:40:44 GMT","HTTP-ETag":"\"47c3-1d3-11134700\"","HTTP-Accept-Ranges":"bytes","HTTP-Content-Length":"467","HTTP-Connection":"close","HTTP-Content-Type":"text\/plain; charset=UTF-8","host":"www.archive.org","year":"2008"}