a larger buffer size (8 -> 64 kiB) seems to speed up gzip, could be worth to try this also for jwarc
webarchive-commons is faster by 36% if no digest is calculated while reading the WARC file
Output:
Benchmarking CC-MAIN-20191207160050-20191207184050-00031.warc.gz
iteration 1
gzipinputstream (buffer 8kB) in 12253ms
gzipinputstream (buffer 64kB) in 10904ms
webarchive-commons 133945 in 40756ms
webarchive-commons (no digest check) 133945 in 24176ms
jwat buff 133945 in 21584ms
jwarc 133945 in 14623ms
iteration 2
gzipinputstream (buffer 8kB) in 12583ms
gzipinputstream (buffer 64kB) in 11482ms
webarchive-commons 133945 in 43104ms
webarchive-commons (no digest check) 133945 in 23460ms
jwat buff 133945 in 20800ms
jwarc 133945 in 14962ms
iteration 3
gzipinputstream (buffer 8kB) in 12953ms
gzipinputstream (buffer 64kB) in 12496ms
webarchive-commons 133945 in 44103ms
webarchive-commons (no digest check) 133945 in 24895ms
jwat buff 133945 in 19573ms
jwarc 133945 in 13978ms
Just a couple of updates of the comparison/benchmarking tool, also to discuss possible further performance improvements
Shortly about the results (on a gzipped WARC file):
Output:
Profile using async-profiler (interactive SVG bench.2020-01-17-17-57.async-prof.svg.gz):