Closed vbanos closed 5 years ago
I performed a benchmark with 100 target sites and 5 browser instances comparing master
and this branch performance.
master
performance was around 30-40 URLs per sec. E.g. one run:
"rates_5min": {
"urls_per_sec": 38.278347561187076,
"warc_bytes_per_sec": 230532.33322409115,
"actual_elapsed": 241.5634467601776
},
This branch performance was around 40-45 URLs per sec. E.g.:
"rates_5min": {
"warc_bytes_per_sec": 251107.9576452377,
"urls_per_sec": 44.06984939210493,
"actual_elapsed": 233.83333826065063
},
the docs lead me to believe your PR doesn’t do what it’s supposed to
* Binary files are buffered in fixed-size chunks; the size of the buffer
is chosen using a heuristic trying to determine the underlying device's
"block size" and falling back on `io.DEFAULT_BUFFER_SIZE`.
On many systems, the buffer will typically be 4096 or 8192 bytes long.
sounds like it only uses io.DEFAULT_BUFFER_SIZE when it’s unable to determine the device block size
also io.DEFAULT_BUFFER_SIZE is a universal option, it would take some auditing to determine what else it might be affecting
if anything you probably want to use the buffering
argument to open
here: https://github.com/internetarchive/warcprox/blob/740a80bfdb6/warcprox/writer.py#L103
all of this calls your benchmark into question of course
When we open a WARC file for writing with the standard
open(filename)
we use aBufferedWriter
to improve performance. This is the default python behavior.BufferedWriter
usesio.DEFAULT_BUFFER_SIZE=8192
by default unless we set a custom value.We set
io.DEFAULT_BUFFER_SIZE=1048576
(1 MB) to speed up file writing. The buffer will be written out to the underlyingRawIOBase
object under various conditions, including:https://docs.python.org/3/library/io.html#io.BufferedWriter
We remove the
flush()
command in the WARC writer.