Open kotwanikunal opened 8 months ago
All of the experiments below were performed on a self managed cluster with 3 r7g.4xl
EC2 nodes, a load balancer and a c5.2xl
EC2 node running opensearch-benchmark
as the load generator.
nyc_taxis
with 6 primary shards and 0 replicas initially with a primary shard size of around 17
1000 MB/s
and IOPS to 6000
, JVM heap configured to 100 GB
100
to facilitate for parallel recovery of shards and the speed limit (indices.recovery.max_bytes_per_sec
) was removed.Shard Size (GBs) | Recovery Time (seconds) | Throughput (MB/s) | Improvement | |
---|---|---|---|---|
File-Based Blocking Threads | 17.8 | 94 | 189.9 | Baseline for recovery |
File-based Virtual Threads | N/A | N/A | N/A | |
Part-based Blocking Threads | 17.8 | 85.5 | 208.23 | 10% over file based |
Part-based Virtual Threads | 17.8 | 70 | 255.17 | 34% over file based 22% over blocking multipart |
Virtual threaded multipart approach is roughly 34% faster than the file approach and ~22% faster than the blocking multipart approach.
nyc_taxis
index with 6 primary shards of roughly 18gb
and 0 replicas initially and a new so
index with 3 primary shards and 1 replica spread across 3 nodes. nyc_taxis
to 2, kicking off recovery, and simultaneously indexing documents on so
to check for overall replication lag
_cat/segment_replication
every 5 seconds till the recovery was in progress and dumping the results to a file2
to test for system fairness and the recovery speed (indices.recovery.max_bytes_per_sec
) was limited 125 MB
Case 1: 1s Refresh Interval for so
Average lag (seconds) | Average Lag (GB) | Max Lag (seconds) | Max Lag (GB) | |
---|---|---|---|---|
Baseline | 1.9 | 0.27 | 42 | 4.6 |
File-Based Blocking Threads | 45 | 1.4 | 132 | 4.8 |
File-based Virtual Threads | 55 | 1.7 | 150 | 5.8 |
Part-based Blocking Threads | 15 | 0.9 | 108 | 5.2 |
Part-based Virtual Threads | 16 | 0.5 | 44 | 1.4 |
Case 2: 10s Refresh Interval for so
Average lag (seconds) | Average Lag (GB) | Max Lag (seonds) | Max Lag (GB) | |
---|---|---|---|---|
File-Based Blocking Threads | 61.84873 | 2.0458 | 198 | 6.9 |
File-based Virtual Threads | 71.4141 | 2.11445 | 246 | 6.3 |
Part-based Blocking Threads | 51.16281 | 1.14736 | 168 | 3.1 |
Part-based Virtual Threads | 41.63 | 1.44 | 126 | 4.8 |
Compared to the file based approach, the virtual, multithreaded approach gives comparatively better performance. The inherent fairness added by the part based approach helps reduce the overall replication lag and improve replication performance.
The baseline case has been added in for comparison purposes and the virtual, multipart approach is the most fair approach as compared to the other approaches.
The next steps for multipart download support would be as follows -
TLDR:
With Security Manager: https://github.com/opensearch-project/OpenSearch/compare/main...kotwanikunal:OpenSearch:virtual-thread-sm Without Security Manager: https://github.com/opensearch-project/OpenSearch/compare/main...kotwanikunal:OpenSearch:virtual-thread
cc: @Bukhtawar @andrross @mch2
@kotwanikunal thanks a lot for exploring this area, there have been discussions related to virtual threads for a while now (both in OpenSearch and Apache Lucene communities), at the moment there are a few cautionary cases that we may look into more closely:
file system I/O (basically accessing files on disk) is not virtual thread friendly yet, at least on some operation systems, https://openjdk.org/jeps/444 warns about that explicitly
The vast majority of blocking operations in the JDK will unmount the virtual thread, freeing its carrier and the underlying OS thread to take on new work. However, some blocking operations in the JDK do not unmount the virtual thread, and thus block both its carrier and the underlying OS thread. This is because of limitations at either the OS level (e.g., many filesystem operations) or the JDK level ...
as you mentioned, Security Manager is a bummer (https://github.com/opensearch-project/OpenSearch/issues/1687), the approached to pass through or disable it are (in my opinion) overly simplistic - they completely defeats the security boundaries. Since the issue is not easy to solve at large, we may consider "sealing" the usage of virtual threads in core to some "trusted" code flows, making sure there is no way to hijack it with "non-trusted" one (aka plugins)
as a side note, we would very likely need to reconsider the large part of the core related to resource tracking, hot threads reporting, pools configurations, .... I am not 100% sure but I believe we may have surprises here.
@kotwanikunal thanks a lot for exploring this area, there have been discussions related to virtual threads for a while now (both in OpenSearch and Apache Lucene communities), at the moment there are a few cautionary cases that we may look into more closely:
- file system I/O (basically accessing files on disk) is not virtual thread friendly yet, at least on some operation systems, https://openjdk.org/jeps/444 warns about that explicitly
Nice to hear from you @reta! :) Given that we have gains, even with the platform threaded approach and there will be improvements over time with virtual threads, would it make more sense to head in this direction v/s reactive/async programming model? Virtual threads will still be opt-in, possibly with a feature flag, to have some bake time.
- as a side note, we would very likely need to reconsider the large part of the core related to resource tracking, hot threads reporting, pools configurations, .... I am not 100% sure but I believe we may have surprises here.
For the approach we are looking at, the queue is not unbounded and we will limit job submissions at the recovery/replication event level. We can build up guardrails and framework around it with these set of operations.
Nice to hear from you @reta! :)
❤️ Same, @kotwanikunal !
Given that we have gains, even with the platform threaded approach and there will be improvements over time with virtual threads, would it make more sense to head in this direction v/s reactive/async programming model?
I think for downloading segments (network I/O) the virtual threads are the perfect fit, for filesystem we may need to weight in the tradeoff. As far as I understand, the experimental implementation does not indicate from where exactly gains come from, just in general there are some. If we could collect precise measurements here - it would help.
For the approach we are looking at, the queue is not unbounded and we will limit job submissions at the recovery/replication event level. We can build up guardrails and framework around it with these set of operations.
My apologies, guardrails are certainly needed but I meant different subject here: we need to have visibility into virtual threads disregarding where they are being used. Can we track the CPU share used by virtual thread? (as we do for regular one) Can we capture stacks to collect hot threads? And things like that ...
Is your feature request related to a problem? Please describe
Coming in from #11461 - we need to investigate utilizing virtual threads as the default I/O mechanism for segment downloads.
Describe the solution you'd like
10786 talks about using virtual threads as the I/O mechanism over the current code base.
As a part of parallel downloads -
A major benefit of the virtual thread mechanism is that the existing code constructs and APIs do not need to change - it will leverage the benefits at a logically lower level.
The aim of this issue is to test and compare the overall effort, performance benefits between the two mechanisms as well as documenting the suggested path forward.
Related component
Search:Remote Search
Describe alternatives you've considered
Additional context
11461