internetarchive / warcprox

WARC writing MITM HTTP/S proxy
378 stars 54 forks source link

Increase IO buffer size to improve WarcWriter performance #132

Closed vbanos closed 5 years ago

vbanos commented 5 years ago

When we open a WARC file for writing with the standard open(filename) we use a BufferedWriter to improve performance. This is the default python behavior.

BufferedWriter uses io.DEFAULT_BUFFER_SIZE=8192 by default unless we set a custom value.

We set io.DEFAULT_BUFFER_SIZE=1048576 (1 MB) to speed up file writing. The buffer will be written out to the underlying RawIOBase object under various conditions, including:

https://docs.python.org/3/library/io.html#io.BufferedWriter

We remove the flush() command in the WARC writer.

vbanos commented 5 years ago

I performed a benchmark with 100 target sites and 5 browser instances comparing master and this branch performance.

master performance was around 30-40 URLs per sec. E.g. one run:

  "rates_5min": {                                                               
    "urls_per_sec": 38.278347561187076,                                         
    "warc_bytes_per_sec": 230532.33322409115,                                   
    "actual_elapsed": 241.5634467601776                                         
  }, 

This branch performance was around 40-45 URLs per sec. E.g.:

  "rates_5min": {
    "warc_bytes_per_sec": 251107.9576452377,
    "urls_per_sec": 44.06984939210493,
    "actual_elapsed": 233.83333826065063
  },
nlevitt commented 5 years ago

the docs lead me to believe your PR doesn’t do what it’s supposed to

    * Binary files are buffered in fixed-size chunks; the size of the buffer
      is chosen using a heuristic trying to determine the underlying device's
      "block size" and falling back on `io.DEFAULT_BUFFER_SIZE`.
      On many systems, the buffer will typically be 4096 or 8192 bytes long.

sounds like it only uses io.DEFAULT_BUFFER_SIZE when it’s unable to determine the device block size also io.DEFAULT_BUFFER_SIZE is a universal option, it would take some auditing to determine what else it might be affecting if anything you probably want to use the buffering argument to open here: https://github.com/internetarchive/warcprox/blob/740a80bfdb6/warcprox/writer.py#L103 all of this calls your benchmark into question of course