Improve performance in sync client

navneet1v commented 1 year ago

What is the bug?

I am using opensearch-py client version 2.2.0 to do search on the a cluster. I started using the multi-threading to push more and more request to the cluster. But I see that after 4 threads we are not able to increase the throughput. I see that no throttling happening from Server, but client is not able to push the request.

How can one reproduce the bug?

Create an OpenSearch cluster, and use OpenSearch-py client with multi-threading to start doing then queries. Thread count we tested with 1,2,4,8,16,32,48,96 threads.

What is the expected behavior?

Expectation is OpenSearch client should be able to push through more and more request and if OpenSearch cluster is not able to take the request we should get a throttling exception.

What is your host/environment?

Linux.

Do you have any screenshots?

NA

Do you have any additional context?

I tried increasing the worker pool size from 10 to 20, but that didn't help.

dtaivpp commented 1 year ago

Hey, what are your clusters index settings? I would expect this with default cluster settings as shards in an index are thread bound as well. A default index will have 3 shards and that's the max number of threads that can be used for ingestion.

dblock commented 1 year ago

Are you spawning threads and creating instances of the client? Can you please post a repro? Is your actual usage of the client sync or async?

I do think that a client should be able to push the server into a state where it starts returning 429s.

navneet1v commented 1 year ago

@dblock It's the same client which is getting used across the threads.

@dtaivpp , I checked the cluster and the CPU util was not going above 60% even at the maximum thread. That is where I think its the problem where client is not able to push the requests.

I also tried changing the pool_max_size from 10 to 20 even then there is no increase in the requests that can be sent to the cluster.

dblock commented 1 year ago

Are you using a sync client or an async client? My guess is that you're using a sync client.

I wrote a benchmark that compares synchronous, threaded and asynchronous I/O in opensearch-py. I was not able to push the synchronous nor the threaded version of the client past a certain limit, no-matter the amount of pool_maxsize or other settings I tried. The client does use a connection pool, but it's ~probably just too busy doing other work on a single thread blocked by the GIL, so spawning more threads doesn't help~.

Sync vs. threaded inserts make no difference. For 5x50 inserts async outperforms sync 1.4x.

┏━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━┓
┃          Benchmark ┃ Min     ┃ Max     ┃ Mean    ┃ Min (+)         ┃ Max (+)         ┃ Mean (+)        ┃
┡━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━┩
│ Sync vs. threaded. │ 1.387   │ 1.604   │ 1.450   │ 1.377 (1.0x)    │ 1.424 (1.1x)    │ 1.399 (1.0x)    │
│    Sync vs. async. │ 1.312   │ 1.479   │ 1.400   │ 0.960 (1.4x)    │ 1.047 (1.4x)    │ 0.994 (1.4x)    │
└────────────────────┴─────────┴─────────┴─────────┴─────────────────┴─────────────────┴─────────────────┘

5x250 items, 2.5x improvement

┡━━━━━━━━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━┩
│ Sync vs. async. │ 4.426   │ 4.994   │ 4.694   │ 1.798 (2.5x)    │ 1.992 (2.5x)    │ 1.894 (2.5x)    │
└─────────────────┴─────────┴─────────┴─────────┴─────────────────┴─────────────────┴─────────────────┘

5x1000 items, 3.5x improvement

┡━━━━━━━━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━┩
│ Sync vs. async. │ 16.629  │ 17.493  │ 16.999  │ 4.863 (3.4x)    │ 5.025 (3.5x)    │ 4.980 (3.4x)    │
└─────────────────┴─────────┴─────────┴─────────┴─────────────────┴─────────────────┴─────────────────┘

We may be able to improve the synchronous client performance on insert, but I don't (yet) know how.

Xtansia commented 1 year ago

I have a (not very scientific) flame chart I recorded of a synchronous client doing some search requests with SigV4. Obviously these numbers are very dependent on the specific request etc but was done with large machines that hardware shouldn't be a bottleneck.

profile_q

dblock commented 1 year ago

Here's the sync/async benchmark profile, it's actually stuck doing synchronous I/O.

navneet1v commented 1 year ago

Here's the sync/async benchmark profile, it's actually stuck doing synchronous I/O.

This is strange, why it is struck in synchronous I/O. My understanding was I/O should be done in async fashion, even when it is a sync client.

dblock commented 1 year ago

This is strange, why it is struck in synchronous I/O. My understanding was I/O should be done in async fashion, even when it is a sync client.

Unfortunately not, and it's by design. The default Urllib3HttpConnection uses urllib3 and the requests are synchronous. This is why aiohttp was introduced, but it requires your code to become async to take advantage of.

navneet1v commented 1 year ago

@dblock can we use some other implementation of HttpConnection. Generally we use RequestsHttpConnection, does that also have the same problem?

navneet1v commented 1 year ago

@dblock here is my thinking, as a user of the OpenSearch client I feel that OpenSearch sync client is easy to use. Most of the services have the same thing. But if we cannot use threading with simple sync clients that a bumber. Because async client is not easy.

dblock commented 1 year ago

@dblock can we use some other implementation of HttpConnection. Generally we use RequestsHttpConnection, does that also have the same problem?

Yes it does. That one does allow changing pool_maxsize, but that has little effect on throughput.

dblock commented 1 year ago

@dblock here is my thinking, as a user of the OpenSearch client I feel that OpenSearch sync client is easy to use. Most of the services have the same thing. But if we cannot use threading with simple sync clients that a bumber. Because async client is not easy.

What are other open-source client examples that talk HTTP that don't exhibit this kind of problem?

PS: I did find https://number1.co.za/how-to-speed-up-http-calls-in-python-with-examples/ that claims that requests-futures can get close to async performance, which could be an idea for a reimplementation, but it ain't a small effort.

navneet1v commented 1 year ago

@dblock can we check milvus.

dblock commented 1 year ago

@dblock can we check milvus.

Seems to be using grpc, whole other ball game.

navneet1v commented 1 year ago

Let me check if there are any other clients I can find out which are not grpc. This seems to be a standard thing and a fix should be there.

wbeckler commented 1 year ago

Can you run multiple pythons with multiple clients? And there is also bulk, which helps max out the connection.

navneet1v commented 1 year ago

@wbeckler No, we don't want to run multiple clients, as that defeats the purpose of what we are trying to benchmark.

Also, the request which we are trying to send is for query and not indexing. Hence bulk is not an option.

dblock commented 1 year ago

Thanks @navneet1v. I do think the problem is clear, we want the client to not block on I/O across requests.

navneet1v commented 1 year ago

Thanks @navneet1v. I do think the problem is clear, we want the client to not block on I/O across requests.

Yes that is correct.

navneet1v commented 1 year ago

@dblock @wbeckler is anything required from my side to prioritize this issue?

wbeckler commented 1 year ago

Is it okay to use python multiprocessing instead of multithreading in your use case? https://docs.python.org/3/library/multiprocessing.html

navneet1v commented 1 year ago

@wbeckler we can use that, but given that sync client is easy to use and is already integrated in various ann benchmarks, this is the reason why we want to keep things simple.

wbeckler commented 1 year ago

I'm suggesting that the benchmarking or high-volume query application rely on multiprocessing, and that way the python client would not be blocking on IO. So instead of multithread: {do python client stuff}, you would have multiprocess: {do python client stuff}

I am not a python expert, so I might be way off here, but that's how it seems it should work from my naive perspective.

dblock commented 1 year ago

I am going to take another look at this, assigning to myself.

dblock commented 1 year ago

Updated benchmarks in https://github.com/dblock/opensearch-py/tree/benchmarks/samples/benchmarks.

Benchmark	Min	Max	Mean	Min (+)	Max (+)	Mean (+)
1 client vs. more clients (async)	1.076	1.076	1.076	0.787 (1.4x)	0.787 (1.4x)	0.787 (1.4x)
1 thread vs. 2 threads (sync)	6.013	6.013	6.013	3.629 (1.7x)	3.629 (1.7x)	3.629 (1.7x)
1 thread vs. 8 threads (sync)	6.031	6.031	6.031	1.186 (5.1x)	1.186 (5.1x)	1.186 (5.1x)
1 thread vs. 32 threads (sync)	5.777	5.777	5.777	1.058 (5.5x)	1.058 (5.5x)	1.058 (5.5x)

We are getting an improvement by using threads (1.7x with 2, 5.1x with 8, with diminishing returns further, 5.5x with 32).

If we set the pool_maxsize to the number of threads this gets improved up to a certain amount.

Benchmark	Min	Max	Mean	Min (+)	Max (+)	Mean (+)
1 client vs. more clients (async)	1.401	1.401	1.401	0.788 (1.8x)	0.788 (1.8x)	0.788 (1.8x)
1 thread vs. 2 threads (sync)	7.355	7.355	7.355	2.835 (2.6x)	2.835 (2.6x)	2.835 (2.6x)
1 thread vs. 4 threads (sync)	6.616	6.616	6.616	1.777 (3.7x)	1.777 (3.7x)	1.777 (3.7x)
1 thread vs. 8 threads (sync)	6.340	6.340	6.340	1.272 (5.0x)	1.272 (5.0x)	1.272 (5.0x)
1 thread vs. 32 threads (sync)	6.608	6.608	6.608	1.076 (6.1x)	1.076 (6.1x)	1.076 (6.1x)

Sync vs. async. Async really just depends on the pool size (3 iterations).

Benchmark	Min	Max	Mean	Min (+)	Max (+)	Mean (+)
async (1 client) vs. sync (8 threads)	3.560	3.560	3.560	3.210 (1.1x)	3.210 (1.1x)	3.210 (1.1x)
async (8 clients) vs. sync (8 threads)	3.526	3.526	3.526	2.342 (1.5x)	2.342 (1.5x)	2.342 (1.5x)

dblock commented 1 year ago

I've spent a lot of time on the synchronous client, having written multiple benchmarks and digging through the code to understand how pooling works. I've varied number of threads, number of clients, search and ingestion, data sizes, switching between the default Urllib3HttpConnection, or the available RequestsHttpConnection connection classes.

The simplest benchmark is to do info() in a loop 100 times. The client improves throughput 5x and 3x with different connection classes on more threads and setting the max pool size to the number of threads.

Benchmark	Min	Max	Mean	Min (+)	Max (+)	Mean (+)
1 thread vs. 32 threads (sync), `RequestsHttpConnection`	30.104	30.104	30.104	5.697 (5.3x)	5.697 (5.3x)	5.697 (5.3x)
1 thread vs. 32 threads (sync), `Urllib3HttpConnection`	9.646	9.646	9.646	2.748 (3.5x)	2.748 (3.5x)	2.748 (3.5x)

This does show that RequestsHttpConnection is a lot slower. @navneet1v mentions we should remove it from being the default recommended in the documentation, which seems right.

Overall, these benchmarks are pretty conclusive that the client does not have the limitation as described and we need to dig deeper into how it's being used to see where the bottleneck is.

navneet1v commented 1 year ago

@dblock Thanks for doing the deep-dive. Most of the time as a user we go by seeing the examples. If changing the connection client is the only problem we should just fix the examples and our documentation.

Also can we mark RequestsHttpConnection as deprecated and more things around using this connection client can degrade the performance.

dblock commented 1 year ago

In https://github.com/opensearch-project/opensearch-py/pull/535 I documented the connection pools and made sure pool_maxsize works for both implementations (it doesn't improve benchmarks by much).

I don't think we should deprecate RequestsHttpConnection, even if it's slower. If an application fully depends on that, you will not want to take a dependency on another HTTP library. Similarly, using async in this client is much faster than Urllib3HttpConnection, but we aren't depredating the synchronous implementation.

navneet1v commented 1 year ago

In #535 I documented the connection pools and made sure pool_maxsize works for both implementations (it doesn't improve benchmarks by much).

I don't think we should deprecate RequestsHttpConnection, even if it's slower. If an application fully depends on that, you will not want to take a dependency on another HTTP library. Similarly, using async in this client is much faster than Urllib3HttpConnection, but we aren't depredating the synchronous implementation.

Make sense. My whole idea was we should provide our recommendation for best performance and also the challenges in using other connection libs RequestsHttpConnection doesn't perform well, per our benchmarks.

VachaShah commented 1 year ago

I've spent a lot of time on the synchronous client, having written multiple benchmarks and digging through the code to understand how pooling works. I've varied number of threads, number of clients, search and ingestion, data sizes, switching between the default Urllib3HttpConnection, or the available RequestsHttpConnection connection classes.

The simplest benchmark is to do info() in a loop 100 times. The client improves throughput 5x and 3x with different connection classes on more threads and setting the max pool size to the number of threads.

Benchmark Min Max Mean Min (+) Max (+) Mean (+) 1 thread vs. 32 threads (sync), RequestsHttpConnection 30.104 30.104 30.104 5.697 (5.3x) 5.697 (5.3x) 5.697 (5.3x) 1 thread vs. 32 threads (sync), Urllib3HttpConnection 9.646 9.646 9.646 2.748 (3.5x) 2.748 (3.5x) 2.748 (3.5x) This does show that RequestsHttpConnection is a lot slower. @navneet1v mentions we should remove it from being the default recommended in the documentation, which seems right.

Overall, these benchmarks are pretty conclusive that the client does not have the limitation as described and we need to dig deeper into how it's being used to see where the bottleneck is.

Its great to see the benchmarks for this client! @dblock Anyway we can make these benchmarks part of this repo or part of opensearch-benchmark?

dblock commented 1 year ago

Its great to see the benchmarks for this client! @dblock Anyway we can make these benchmarks part of this repo or part of opensearch-benchmark?

https://github.com/opensearch-project/opensearch-py/pull/537

dblock commented 11 months ago

I renamed the issue to "improve performance in sync client".

@navneet1v I know you've been running some anecdotal benchmarks with k-nn. What are your most recent numbers between RequestsHttpConnection, Urllib3HttpConnection, multiple clients and async?

navneet1v commented 11 months ago

@dblock Here is results.

So what I can see is for Multi Threading change connection class from Request HTTP Connection to Urllib3 helped improve the latency if you do more work than just doing the search in same thread(aka parsing and doing some more calculation).
In case you just do search, then there is no impact in latency with and without Request HTTP Connection.
Multi processing code works better with Request HTTP Connection class as compared to Urllib3.

With Request HTTP Connection Client

Multi Threading

With Parsing
average latency: 108.16474408659936
p50 Latency(sec): 104.17016150003633
p90 Latency(sec): 156.73814179999113
p99 Latency(sec): 215.99599194000638
average took time latency: 12.9771
p50 took Latency(ms): 11.0
p90 took Latency(ms): 18.0
p99 took Latency(ms): 42.0

Without parsing logic

average latency: 55.61106889247895
p50 Latency(sec): 52.75726318359375
p90 Latency(sec): 76.60882472991945
p99 Latency(sec): 114.45076704025274
average took time latency: 25721.5
p50 took Latency(ms): 23.0
p90 took Latency(ms): 42.0
p99 took Latency(ms): 73.0

Multi Processing

With Parsing
average latency: 41.32260456085205
p50 Latency(ms): 39.499521255493164
p90 Latency(ms): 54.305076599121094
p99 Latency(ms): 85.85322380065924
p50 took Latency(ms): 16.0
p90 took Latency(ms): 28.0
p99 took Latency(ms): 49.0

Without parsing logic
average latency: 52.904314446449284
p50 Latency(ms): 50.8497953414917
p90 Latency(ms): 71.18616104125977
p99 Latency(ms): 100.52832365036012
p50 took Latency(ms): 22.0
p90 took Latency(ms): 39.0
p99 took Latency(ms): 63.0

With UrlLib3

Multi Threading

With Parsing
average latency: 79.73963953840034
p50 Latency(sec): 77.58072550001316
p90 Latency(sec): 114.84195310003997
p99 Latency(sec): 156.2950457700498
average took time latency: 11.8679
p50 took Latency(ms): 11.0
p90 took Latency(ms): 14.0
p99 took Latency(ms): 29.0

Without parsing logic
average latency: 54.77132937908173
p50 Latency(sec): 51.97501182556152
p90 Latency(sec): 75.43172836303712
p99 Latency(sec): 111.9501113891602
average took time latency: 25799.7
p50 took Latency(ms): 23.0
p90 took Latency(ms): 41.0
p99 took Latency(ms): 74.0

Multi Processing

With Parsing
average latency: 42.12491910457611
p50 Latency(ms): 40.169358253479004
p90 Latency(ms): 56.438875198364265
p99 Latency(ms): 84.30349588394165
p50 took Latency(ms): 16.0
p90 took Latency(ms): 30.0
p99 took Latency(ms): 51.01000000000022

Without parsing logic
average latency: 53.499785208702086
p50 Latency(ms): 50.940632820129395
p90 Latency(ms): 72.27611541748048
p99 Latency(ms): 106.08061790466309
p50 took Latency(ms): 23.0
p90 took Latency(ms): 40.0
p99 took Latency(ms): 66.01000000000022

dblock commented 11 months ago

Thanks, this is helpful as a comparison of multithreading vs. multiprocessing and thank you for the conclusions, I agree with them. Here's a more readable version.

	requests		urllib3
	threads	processes	threads	processes
with parsing

average	108.16	41.32	79.74	42.12
p50	104.17	39.50	77.58	40.17
p90	156.74	54.31	114.84	56.44
p99	216.00	85.85	156.30	84.30

average took	12.98		11.87
p50 took	11.00	16.00	11.00	16.00
p90 took	18.00	28.00	14.00	30.00
p99 took	42.00	49.00	29.00	51.01

without parsing

average latency	55.61	52.90	54.77	53.50
p50	52.76	50.85	51.98	50.94
p90	76.61	71.19	75.43	72.28
p99	114.45	100.53	111.95	106.08

average latency	25.72		25.80
p50 took	23.00	22.00	23.00	23.00
p90 took	42.00	39.00	41.00	40.00
p99 took	73.00	63.00	74.00	66.01

tl;dr with urllib3 we're seeing a 30% improvement over where this issue started

We can conclude that urllib3 is 14% better than requests in pure transport. If we are doing more work, urllib performs better by 30% than requests (likely because it has less blocking behavior). Finally, because because of additional processing multithreading is roughly 2x slower that multiprocessing with urllib3.

We need to move to proper benchmarking from here I think, with the goal of reducing the 156ms p99 in multithreading above closer to the 84.30ms multiprocessing.

We have 3 kinds of requests possible:

GET without content (e.g. info())
POST with a large payload sent, a small payload returned, e.g. bulk()
POST with a small payload sent, a large payload returned, e.g. search()

It's clear with @navneet1v's numbers that the blocking nature of the client has outsized impact on overall processing. So apart from pure client transport benchmarking along the (3) variations above, we will need to add processing to simulate realistic scenarios.

opensearch-project / opensearch-py