opensearch-project / OpenSearch

🔎 Open source distributed and RESTful search engine.
https://opensearch.org/docs/latest/opensearch/index/
Apache License 2.0
9.44k stars 1.73k forks source link

[Feature Request] [Parallel downloads] Investigate the use of virtual threads as the new I/O mechanism for segment downloads #11708

Open kotwanikunal opened 8 months ago

kotwanikunal commented 8 months ago

Is your feature request related to a problem? Please describe

Coming in from #11461 - we need to investigate utilizing virtual threads as the default I/O mechanism for segment downloads.

Describe the solution you'd like

10786 talks about using virtual threads as the I/O mechanism over the current code base.

As a part of parallel downloads -

A major benefit of the virtual thread mechanism is that the existing code constructs and APIs do not need to change - it will leverage the benefits at a logically lower level.

The aim of this issue is to test and compare the overall effort, performance benefits between the two mechanisms as well as documenting the suggested path forward.

Related component

Search:Remote Search

Describe alternatives you've considered

Additional context

kotwanikunal commented 6 months ago

Experiments performed

All of the experiments below were performed on a self managed cluster with 3 r7g.4xl EC2 nodes, a load balancer and a c5.2xl EC2 node running opensearch-benchmark as the load generator.

Recovery

Shard Size (GBs) Recovery Time (seconds) Throughput (MB/s) Improvement
File-Based Blocking Threads 17.8 94 189.9 Baseline for recovery
File-based Virtual Threads N/A N/A N/A
Part-based Blocking Threads 17.8 85.5 208.23 10% over file based
Part-based Virtual Threads 17.8 70 255.17 34% over file based
22% over blocking multipart

Virtual threaded multipart approach is roughly 34% faster than the file approach and ~22% faster than the blocking multipart approach.

Replication

Case 1: 1s Refresh Interval for so

Average lag (seconds) Average Lag (GB) Max Lag (seconds) Max Lag (GB)
Baseline 1.9 0.27 42 4.6
File-Based Blocking Threads 45 1.4 132 4.8
File-based Virtual Threads 55 1.7 150 5.8
Part-based Blocking Threads 15 0.9 108 5.2
Part-based Virtual Threads 16 0.5 44 1.4

Case 2: 10s Refresh Interval for so

Average lag (seconds) Average Lag (GB) Max Lag (seonds) Max Lag (GB)
File-Based Blocking Threads 61.84873 2.0458 198 6.9
File-based Virtual Threads 71.4141 2.11445 246 6.3
Part-based Blocking Threads 51.16281 1.14736 168 3.1
Part-based Virtual Threads 41.63 1.44 126 4.8

Compared to the file based approach, the virtual, multithreaded approach gives comparatively better performance. The inherent fairness added by the part based approach helps reduce the overall replication lag and improve replication performance.

The baseline case has been added in for comparison purposes and the virtual, multipart approach is the most fair approach as compared to the other approaches.

Security (JSM)

Path forward

The next steps for multipart download support would be as follows -

  1. Add support for platform threaded multipart downloads as the default download mechanism for remote store operations
    1. Virtual Thread support will be added for JDK21+ runtimes and is described as step 2
    2. This will be wrapped behind a feature release flag with defaults to file-based downloads pre launch
  2. Add a multi release JAR support for JDK 21+/<JDK21
    1. This will ensure usage of virtual threads with JDK21+ where support exists
    2. This has been previously implemented here:
  3. For JSM, there are 2 possible approaches -
    1. Enable security manager with a passthrough just for virtual threads as highlighted here
    2. Enable Virtual Threads on JDK21+ only if security manager is disabled

TLDR:

Code

With Security Manager: https://github.com/opensearch-project/OpenSearch/compare/main...kotwanikunal:OpenSearch:virtual-thread-sm Without Security Manager: https://github.com/opensearch-project/OpenSearch/compare/main...kotwanikunal:OpenSearch:virtual-thread

cc: @Bukhtawar @andrross @mch2

reta commented 6 months ago

@kotwanikunal thanks a lot for exploring this area, there have been discussions related to virtual threads for a while now (both in OpenSearch and Apache Lucene communities), at the moment there are a few cautionary cases that we may look into more closely:

kotwanikunal commented 6 months ago

@kotwanikunal thanks a lot for exploring this area, there have been discussions related to virtual threads for a while now (both in OpenSearch and Apache Lucene communities), at the moment there are a few cautionary cases that we may look into more closely:

  • file system I/O (basically accessing files on disk) is not virtual thread friendly yet, at least on some operation systems, https://openjdk.org/jeps/444 warns about that explicitly

Nice to hear from you @reta! :) Given that we have gains, even with the platform threaded approach and there will be improvements over time with virtual threads, would it make more sense to head in this direction v/s reactive/async programming model? Virtual threads will still be opt-in, possibly with a feature flag, to have some bake time.

kotwanikunal commented 6 months ago
  • as a side note, we would very likely need to reconsider the large part of the core related to resource tracking, hot threads reporting, pools configurations, .... I am not 100% sure but I believe we may have surprises here.

For the approach we are looking at, the queue is not unbounded and we will limit job submissions at the recovery/replication event level. We can build up guardrails and framework around it with these set of operations.

reta commented 6 months ago

Nice to hear from you @reta! :)

❤️ Same, @kotwanikunal !

Given that we have gains, even with the platform threaded approach and there will be improvements over time with virtual threads, would it make more sense to head in this direction v/s reactive/async programming model?

I think for downloading segments (network I/O) the virtual threads are the perfect fit, for filesystem we may need to weight in the tradeoff. As far as I understand, the experimental implementation does not indicate from where exactly gains come from, just in general there are some. If we could collect precise measurements here - it would help.

For the approach we are looking at, the queue is not unbounded and we will limit job submissions at the recovery/replication event level. We can build up guardrails and framework around it with these set of operations.

My apologies, guardrails are certainly needed but I meant different subject here: we need to have visibility into virtual threads disregarding where they are being used. Can we track the CPU share used by virtual thread? (as we do for regular one) Can we capture stacks to collect hot threads? And things like that ...