internetarchive / warcprox

WARC writing MITM HTTP/S proxy
381 stars 54 forks source link

concurrency bug when running with multiple warc writer threads #101

Open nlevitt opened 6 years ago

nlevitt commented 6 years ago

In July @vbanos reported invalid gzip data in a warc written by warcprox with --writer-threads=5.

My benchmarking suggests that 1 writer thread is optimal: https://github.com/internetarchive/warcprox/wiki/benchmarking-number-of-threads At IA we are running warcprox with 1 writer thread everywhere.

It would be nice to find and fix this bug, but the pragmatic thing might be to remove support for multiple writer threads.

anjackson commented 6 years ago

It looks like the default is one WARC writer thread. Has that always been the case?

nlevitt commented 6 years ago

For a short time that was not the case, from approximately fd811905170 (Feb 5) to a1930495af3 (Apr 12)