Use of criterion, plus 'cargo flamegraph', showed that we were CPU-bound due to excessive thrashing on the BTreeMap readahead 'cache'. This was because it was full of many very tiny reads; we were creating an entry for every single read during random-access mode, and then potentially more tiny entries by splitting cache cells later. All this thrash happened while we had the 'state' mutex claimed, effectively reducing ripunzip performance to be single-threaded.
With this change, we:
always read into the cache, instead of sometimes directly into the requested read buffer. We then service the requested read from the cache. This obviously involves more copies, but simplifies the logic and means all items inserted into the cache are large.
never split cache blocks. Instead, we keep a count of how much has been consumed, and discard a block when it's all been read.
empircally adjust the block size for best performance.
Together, these changes eliminate significant CPU usage from the BTreeMap.
Use of criterion, plus 'cargo flamegraph', showed that we were CPU-bound due to excessive thrashing on the BTreeMap readahead 'cache'. This was because it was full of many very tiny reads; we were creating an entry for every single read during random-access mode, and then potentially more tiny entries by splitting cache cells later. All this thrash happened while we had the 'state' mutex claimed, effectively reducing ripunzip performance to be single-threaded.
With this change, we:
Together, these changes eliminate significant CPU usage from the BTreeMap.