opensearch-project / OpenSearch

🔎 Open source distributed and RESTful search engine.
https://opensearch.org/docs/latest/opensearch/index/
Apache License 2.0
9.69k stars 1.79k forks source link

[Remote Store] Optimization in remote upload async flow #11263

Open vikasvb90 opened 11 months ago

vikasvb90 commented 11 months ago

Is your feature request related to a problem? Please describe. BufferedInputStream was seen providing visible improvement in throughput. We have seen considerable difference in our numbers when we modify disk configurations. This also explains why buffered stream is helping us which is because we are essentially reducing number of disk read round trips by caching data and avoiding disk reads on S3 retries.

Describe the solution you'd like Now, to further optimise this I was thinking of something. Currently we execute these ops in sequence - disk read -> checksum update -> encrypt -> submit for upload ( we don't wait for upload to be successful) . This is the job of stream_reader thread. Since, our bottleneck is disk read, we can reduce it by making it async.

  1. We can prefill data in async in buffered stream on every read so that wait time on disk read is removed.
  2. We can use AsynchronousFileChannel to make it async without spawning a new thread. It is straightforward in translog but need to check if lucene file abstractions have something similar.

This is a complicated change as it will require two things -> a data supplier to be passed down to buffered stream and an overridden buffered stream which can eager fill the buffer. In addition to this, we will also need a global rate limiter on the amount of data we can cache in memory in BufferedInputStream.

vikasvb90 commented 11 months ago

cc: @Bukhtawar @gbbafna @sachinpkale @ashking94

gbbafna commented 6 months ago

[Storage Triage Meeting] Hi @vikasvb90 , do we any POC/some numbers around the possible gains from this ?

vikasvb90 commented 6 months ago

@gbbafna This was an observation based on impact on throughput with different disk configurations. I don't have any numbers around the proposal. We can try something simpler - a synchronous version where we can experiment with higher IO threads and some bounded CPU threads for data integrity and encryption. There would be some overhead of context switch with this approach though.