Open vikasvb90 opened 11 months ago
cc: @Bukhtawar @gbbafna @sachinpkale @ashking94
[Storage Triage Meeting] Hi @vikasvb90 , do we any POC/some numbers around the possible gains from this ?
@gbbafna This was an observation based on impact on throughput with different disk configurations. I don't have any numbers around the proposal. We can try something simpler - a synchronous version where we can experiment with higher IO threads and some bounded CPU threads for data integrity and encryption. There would be some overhead of context switch with this approach though.
Is your feature request related to a problem? Please describe. BufferedInputStream was seen providing visible improvement in throughput. We have seen considerable difference in our numbers when we modify disk configurations. This also explains why buffered stream is helping us which is because we are essentially reducing number of disk read round trips by caching data and avoiding disk reads on S3 retries.
Describe the solution you'd like Now, to further optimise this I was thinking of something. Currently we execute these ops in sequence - disk read -> checksum update -> encrypt -> submit for upload ( we don't wait for upload to be successful) . This is the job of stream_reader thread. Since, our bottleneck is disk read, we can reduce it by making it async.
This is a complicated change as it will require two things -> a data supplier to be passed down to buffered stream and an overridden buffered stream which can eager fill the buffer. In addition to this, we will also need a global rate limiter on the amount of data we can cache in memory in BufferedInputStream.