Open navneet1v opened 1 year ago
Hey, what are your clusters index settings? I would expect this with default cluster settings as shards in an index are thread bound as well. A default index will have 3 shards and that's the max number of threads that can be used for ingestion.
Are you spawning threads and creating instances of the client? Can you please post a repro? Is your actual usage of the client sync or async?
I do think that a client should be able to push the server into a state where it starts returning 429s.
@dblock It's the same client which is getting used across the threads.
@dtaivpp , I checked the cluster and the CPU util was not going above 60% even at the maximum thread. That is where I think its the problem where client is not able to push the requests.
I also tried changing the pool_max_size from 10 to 20 even then there is no increase in the requests that can be sent to the cluster.
Are you using a sync client or an async client? My guess is that you're using a sync client.
I wrote a benchmark that compares synchronous, threaded and asynchronous I/O in opensearch-py. I was not able to push the synchronous nor the threaded version of the client past a certain limit, no-matter the amount of pool_maxsize
or other settings I tried. The client does use a connection pool, but it's ~probably just too busy doing other work on a single thread blocked by the GIL, so spawning more threads doesn't help~.
Sync vs. threaded inserts make no difference. For 5x50 inserts async outperforms sync 1.4x.
┏━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━┓
┃ Benchmark ┃ Min ┃ Max ┃ Mean ┃ Min (+) ┃ Max (+) ┃ Mean (+) ┃
┡━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━┩
│ Sync vs. threaded. │ 1.387 │ 1.604 │ 1.450 │ 1.377 (1.0x) │ 1.424 (1.1x) │ 1.399 (1.0x) │
│ Sync vs. async. │ 1.312 │ 1.479 │ 1.400 │ 0.960 (1.4x) │ 1.047 (1.4x) │ 0.994 (1.4x) │
└────────────────────┴─────────┴─────────┴─────────┴─────────────────┴─────────────────┴─────────────────┘
5x250 items, 2.5x improvement
┡━━━━━━━━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━┩
│ Sync vs. async. │ 4.426 │ 4.994 │ 4.694 │ 1.798 (2.5x) │ 1.992 (2.5x) │ 1.894 (2.5x) │
└─────────────────┴─────────┴─────────┴─────────┴─────────────────┴─────────────────┴─────────────────┘
5x1000 items, 3.5x improvement
┡━━━━━━━━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━┩
│ Sync vs. async. │ 16.629 │ 17.493 │ 16.999 │ 4.863 (3.4x) │ 5.025 (3.5x) │ 4.980 (3.4x) │
└─────────────────┴─────────┴─────────┴─────────┴─────────────────┴─────────────────┴─────────────────┘
We may be able to improve the synchronous client performance on insert, but I don't (yet) know how.
I have a (not very scientific) flame chart I recorded of a synchronous client doing some search requests with SigV4. Obviously these numbers are very dependent on the specific request etc but was done with large machines that hardware shouldn't be a bottleneck.
Here's the sync/async benchmark profile, it's actually stuck doing synchronous I/O.
Here's the sync/async benchmark profile, it's actually stuck doing synchronous I/O.
This is strange, why it is struck in synchronous I/O. My understanding was I/O should be done in async fashion, even when it is a sync client.
This is strange, why it is struck in synchronous I/O. My understanding was I/O should be done in async fashion, even when it is a sync client.
Unfortunately not, and it's by design. The default Urllib3HttpConnection
uses urllib3
and the requests are synchronous. This is why aiohttp
was introduced, but it requires your code to become async to take advantage of.
@dblock can we use some other implementation of HttpConnection. Generally we use RequestsHttpConnection
, does that also have the same problem?
@dblock here is my thinking, as a user of the OpenSearch client I feel that OpenSearch sync client is easy to use. Most of the services have the same thing. But if we cannot use threading with simple sync clients that a bumber. Because async client is not easy.
@dblock can we use some other implementation of HttpConnection. Generally we use
RequestsHttpConnection
, does that also have the same problem?
Yes it does. That one does allow changing pool_maxsize
, but that has little effect on throughput.
@dblock here is my thinking, as a user of the OpenSearch client I feel that OpenSearch sync client is easy to use. Most of the services have the same thing. But if we cannot use threading with simple sync clients that a bumber. Because async client is not easy.
What are other open-source client examples that talk HTTP that don't exhibit this kind of problem?
PS: I did find https://number1.co.za/how-to-speed-up-http-calls-in-python-with-examples/ that claims that requests-futures can get close to async performance, which could be an idea for a reimplementation, but it ain't a small effort.
@dblock can we check milvus.
@dblock can we check milvus.
Seems to be using grpc, whole other ball game.
Let me check if there are any other clients I can find out which are not grpc. This seems to be a standard thing and a fix should be there.
Can you run multiple pythons with multiple clients? And there is also bulk, which helps max out the connection.
@wbeckler No, we don't want to run multiple clients, as that defeats the purpose of what we are trying to benchmark.
Also, the request which we are trying to send is for query and not indexing. Hence bulk is not an option.
Thanks @navneet1v. I do think the problem is clear, we want the client to not block on I/O across requests.
Thanks @navneet1v. I do think the problem is clear, we want the client to not block on I/O across requests.
Yes that is correct.
@dblock @wbeckler is anything required from my side to prioritize this issue?
Is it okay to use python multiprocessing instead of multithreading in your use case? https://docs.python.org/3/library/multiprocessing.html
@wbeckler we can use that, but given that sync client is easy to use and is already integrated in various ann benchmarks, this is the reason why we want to keep things simple.
I'm suggesting that the benchmarking or high-volume query application rely on multiprocessing, and that way the python client would not be blocking on IO. So instead of multithread: {do python client stuff}, you would have multiprocess: {do python client stuff}
I am not a python expert, so I might be way off here, but that's how it seems it should work from my naive perspective.
I am going to take another look at this, assigning to myself.
Updated benchmarks in https://github.com/dblock/opensearch-py/tree/benchmarks/samples/benchmarks.
Benchmark | Min | Max | Mean | Min (+) | Max (+) | Mean (+) |
---|---|---|---|---|---|---|
1 client vs. more clients (async) | 1.076 | 1.076 | 1.076 | 0.787 (1.4x) | 0.787 (1.4x) | 0.787 (1.4x) |
1 thread vs. 2 threads (sync) | 6.013 | 6.013 | 6.013 | 3.629 (1.7x) | 3.629 (1.7x) | 3.629 (1.7x) |
1 thread vs. 8 threads (sync) | 6.031 | 6.031 | 6.031 | 1.186 (5.1x) | 1.186 (5.1x) | 1.186 (5.1x) |
1 thread vs. 32 threads (sync) | 5.777 | 5.777 | 5.777 | 1.058 (5.5x) | 1.058 (5.5x) | 1.058 (5.5x) |
We are getting an improvement by using threads (1.7x with 2, 5.1x with 8, with diminishing returns further, 5.5x with 32).
If we set the pool_maxsize
to the number of threads this gets improved up to a certain amount.
Benchmark | Min | Max | Mean | Min (+) | Max (+) | Mean (+) |
---|---|---|---|---|---|---|
1 client vs. more clients (async) | 1.401 | 1.401 | 1.401 | 0.788 (1.8x) | 0.788 (1.8x) | 0.788 (1.8x) |
1 thread vs. 2 threads (sync) | 7.355 | 7.355 | 7.355 | 2.835 (2.6x) | 2.835 (2.6x) | 2.835 (2.6x) |
1 thread vs. 4 threads (sync) | 6.616 | 6.616 | 6.616 | 1.777 (3.7x) | 1.777 (3.7x) | 1.777 (3.7x) |
1 thread vs. 8 threads (sync) | 6.340 | 6.340 | 6.340 | 1.272 (5.0x) | 1.272 (5.0x) | 1.272 (5.0x) |
1 thread vs. 32 threads (sync) | 6.608 | 6.608 | 6.608 | 1.076 (6.1x) | 1.076 (6.1x) | 1.076 (6.1x) |
Sync vs. async. Async really just depends on the pool size (3 iterations).
Benchmark | Min | Max | Mean | Min (+) | Max (+) | Mean (+) |
---|---|---|---|---|---|---|
async (1 client) vs. sync (8 threads) | 3.560 | 3.560 | 3.560 | 3.210 (1.1x) | 3.210 (1.1x) | 3.210 (1.1x) |
async (8 clients) vs. sync (8 threads) | 3.526 | 3.526 | 3.526 | 2.342 (1.5x) | 2.342 (1.5x) | 2.342 (1.5x) |
I've spent a lot of time on the synchronous client, having written multiple benchmarks and digging through the code to understand how pooling works. I've varied number of threads, number of clients, search and ingestion, data sizes, switching between the default Urllib3HttpConnection
, or the available RequestsHttpConnection
connection classes.
The simplest benchmark is to do info()
in a loop 100 times. The client improves throughput 5x and 3x with different connection classes on more threads and setting the max pool size to the number of threads.
Benchmark | Min | Max | Mean | Min (+) | Max (+) | Mean (+) |
---|---|---|---|---|---|---|
1 thread vs. 32 threads (sync), RequestsHttpConnection |
30.104 | 30.104 | 30.104 | 5.697 (5.3x) | 5.697 (5.3x) | 5.697 (5.3x) |
1 thread vs. 32 threads (sync), Urllib3HttpConnection |
9.646 | 9.646 | 9.646 | 2.748 (3.5x) | 2.748 (3.5x) | 2.748 (3.5x) |
This does show that RequestsHttpConnection
is a lot slower. @navneet1v mentions we should remove it from being the default recommended in the documentation, which seems right.
Overall, these benchmarks are pretty conclusive that the client does not have the limitation as described and we need to dig deeper into how it's being used to see where the bottleneck is.
@dblock Thanks for doing the deep-dive. Most of the time as a user we go by seeing the examples. If changing the connection client is the only problem we should just fix the examples and our documentation.
Also can we mark RequestsHttpConnection as deprecated and more things around using this connection client can degrade the performance.
In https://github.com/opensearch-project/opensearch-py/pull/535 I documented the connection pools and made sure pool_maxsize
works for both implementations (it doesn't improve benchmarks by much).
I don't think we should deprecate RequestsHttpConnection
, even if it's slower. If an application fully depends on that, you will not want to take a dependency on another HTTP library. Similarly, using async in this client is much faster than Urllib3HttpConnection
, but we aren't depredating the synchronous implementation.
In #535 I documented the connection pools and made sure
pool_maxsize
works for both implementations (it doesn't improve benchmarks by much).I don't think we should deprecate
RequestsHttpConnection
, even if it's slower. If an application fully depends on that, you will not want to take a dependency on another HTTP library. Similarly, using async in this client is much faster thanUrllib3HttpConnection
, but we aren't depredating the synchronous implementation.
Make sense. My whole idea was we should provide our recommendation for best performance and also the challenges in using other connection libs RequestsHttpConnection
doesn't perform well, per our benchmarks.
I've spent a lot of time on the synchronous client, having written multiple benchmarks and digging through the code to understand how pooling works. I've varied number of threads, number of clients, search and ingestion, data sizes, switching between the default
Urllib3HttpConnection
, or the availableRequestsHttpConnection
connection classes.The simplest benchmark is to do
info()
in a loop 100 times. The client improves throughput 5x and 3x with different connection classes on more threads and setting the max pool size to the number of threads.Benchmark Min Max Mean Min (+) Max (+) Mean (+) 1 thread vs. 32 threads (sync),
RequestsHttpConnection
30.104 30.104 30.104 5.697 (5.3x) 5.697 (5.3x) 5.697 (5.3x) 1 thread vs. 32 threads (sync),Urllib3HttpConnection
9.646 9.646 9.646 2.748 (3.5x) 2.748 (3.5x) 2.748 (3.5x) This does show thatRequestsHttpConnection
is a lot slower. @navneet1v mentions we should remove it from being the default recommended in the documentation, which seems right.Overall, these benchmarks are pretty conclusive that the client does not have the limitation as described and we need to dig deeper into how it's being used to see where the bottleneck is.
Its great to see the benchmarks for this client! @dblock Anyway we can make these benchmarks part of this repo or part of opensearch-benchmark?
Its great to see the benchmarks for this client! @dblock Anyway we can make these benchmarks part of this repo or part of opensearch-benchmark?
https://github.com/opensearch-project/opensearch-py/pull/537
I renamed the issue to "improve performance in sync client".
@navneet1v I know you've been running some anecdotal benchmarks with k-nn. What are your most recent numbers between RequestsHttpConnection, Urllib3HttpConnection, multiple clients and async?
@dblock Here is results.
With Parsing
average latency: 108.16474408659936
p50 Latency(sec): 104.17016150003633
p90 Latency(sec): 156.73814179999113
p99 Latency(sec): 215.99599194000638
average took time latency: 12.9771
p50 took Latency(ms): 11.0
p90 took Latency(ms): 18.0
p99 took Latency(ms): 42.0
Without parsing logic
average latency: 55.61106889247895
p50 Latency(sec): 52.75726318359375
p90 Latency(sec): 76.60882472991945
p99 Latency(sec): 114.45076704025274
average took time latency: 25721.5
p50 took Latency(ms): 23.0
p90 took Latency(ms): 42.0
p99 took Latency(ms): 73.0
With Parsing
average latency: 41.32260456085205
p50 Latency(ms): 39.499521255493164
p90 Latency(ms): 54.305076599121094
p99 Latency(ms): 85.85322380065924
p50 took Latency(ms): 16.0
p90 took Latency(ms): 28.0
p99 took Latency(ms): 49.0
Without parsing logic
average latency: 52.904314446449284
p50 Latency(ms): 50.8497953414917
p90 Latency(ms): 71.18616104125977
p99 Latency(ms): 100.52832365036012
p50 took Latency(ms): 22.0
p90 took Latency(ms): 39.0
p99 took Latency(ms): 63.0
With Parsing
average latency: 79.73963953840034
p50 Latency(sec): 77.58072550001316
p90 Latency(sec): 114.84195310003997
p99 Latency(sec): 156.2950457700498
average took time latency: 11.8679
p50 took Latency(ms): 11.0
p90 took Latency(ms): 14.0
p99 took Latency(ms): 29.0
Without parsing logic
average latency: 54.77132937908173
p50 Latency(sec): 51.97501182556152
p90 Latency(sec): 75.43172836303712
p99 Latency(sec): 111.9501113891602
average took time latency: 25799.7
p50 took Latency(ms): 23.0
p90 took Latency(ms): 41.0
p99 took Latency(ms): 74.0
With Parsing
average latency: 42.12491910457611
p50 Latency(ms): 40.169358253479004
p90 Latency(ms): 56.438875198364265
p99 Latency(ms): 84.30349588394165
p50 took Latency(ms): 16.0
p90 took Latency(ms): 30.0
p99 took Latency(ms): 51.01000000000022
Without parsing logic
average latency: 53.499785208702086
p50 Latency(ms): 50.940632820129395
p90 Latency(ms): 72.27611541748048
p99 Latency(ms): 106.08061790466309
p50 took Latency(ms): 23.0
p90 took Latency(ms): 40.0
p99 took Latency(ms): 66.01000000000022
Thanks, this is helpful as a comparison of multithreading vs. multiprocessing and thank you for the conclusions, I agree with them. Here's a more readable version.
requests | urllib3 | |||
---|---|---|---|---|
threads | processes | threads | processes | |
with parsing | ||||
average | 108.16 | 41.32 | 79.74 | 42.12 |
p50 | 104.17 | 39.50 | 77.58 | 40.17 |
p90 | 156.74 | 54.31 | 114.84 | 56.44 |
p99 | 216.00 | 85.85 | 156.30 | 84.30 |
average took | 12.98 | 11.87 | ||
p50 took | 11.00 | 16.00 | 11.00 | 16.00 |
p90 took | 18.00 | 28.00 | 14.00 | 30.00 |
p99 took | 42.00 | 49.00 | 29.00 | 51.01 |
without parsing | ||||
average latency | 55.61 | 52.90 | 54.77 | 53.50 |
p50 | 52.76 | 50.85 | 51.98 | 50.94 |
p90 | 76.61 | 71.19 | 75.43 | 72.28 |
p99 | 114.45 | 100.53 | 111.95 | 106.08 |
average latency | 25.72 | 25.80 | ||
p50 took | 23.00 | 22.00 | 23.00 | 23.00 |
p90 took | 42.00 | 39.00 | 41.00 | 40.00 |
p99 took | 73.00 | 63.00 | 74.00 | 66.01 |
tl;dr with urllib3 we're seeing a 30% improvement over where this issue started
We can conclude that urllib3 is 14% better than requests in pure transport. If we are doing more work, urllib performs better by 30% than requests (likely because it has less blocking behavior). Finally, because because of additional processing multithreading is roughly 2x slower that multiprocessing with urllib3.
We need to move to proper benchmarking from here I think, with the goal of reducing the 156ms p99 in multithreading above closer to the 84.30ms multiprocessing.
We have 3 kinds of requests possible:
GET
without content (e.g. info()
)POST
with a large payload sent, a small payload returned, e.g. bulk()
POST
with a small payload sent, a large payload returned, e.g. search()
It's clear with @navneet1v's numbers that the blocking nature of the client has outsized impact on overall processing. So apart from pure client transport benchmarking along the (3) variations above, we will need to add processing to simulate realistic scenarios.
What is the bug?
I am using opensearch-py client version 2.2.0 to do search on the a cluster. I started using the multi-threading to push more and more request to the cluster. But I see that after 4 threads we are not able to increase the throughput. I see that no throttling happening from Server, but client is not able to push the request.
How can one reproduce the bug?
Create an OpenSearch cluster, and use OpenSearch-py client with multi-threading to start doing then queries. Thread count we tested with 1,2,4,8,16,32,48,96 threads.
What is the expected behavior?
Expectation is OpenSearch client should be able to push through more and more request and if OpenSearch cluster is not able to take the request we should get a throttling exception.
What is your host/environment?
Linux.
Do you have any screenshots?
NA
Do you have any additional context?
I tried increasing the worker pool size from 10 to 20, but that didn't help.