opensearch-project / opensearch-py

Python Client for OpenSearch
https://opensearch.org/docs/latest/clients/python/
Apache License 2.0
350 stars 175 forks source link

[BUG] msearch hangs when dealing with a high number of records. #517

Closed manugarri closed 1 year ago

manugarri commented 1 year ago

What is the bug?

Im running a search job on a big batch file (900K records). as such, im using multisearch. The cluster has 3 data nodes and 3 master nodes.

I split the records in batches. The weird thing is, if i run batches of 5000 records. the job takes around 200 seconds to process. monitoring aws metrics show no apparent issue with memory/cpu on any of the nodes.

However, if i use 10000 records for the msearch command, something strange happens.

For a while the cluster is performing the search operations, i can see there are active/queued on the threadpool api endpoint /_cat/thread_pool/search . However, after a certain point, there are no more active/queue/rejected threads on the threadpool, but the python msearch call just hangs , and it hangs around for ever. I have to kill the jupyter kernel to make it work.

How can one reproduce the bug?

Cant share the data im using unfortunately, and the data used for search is correlated with the number of records that make the search hang.

But in a nutshell, running this

msearch_result = search_client.msearch(
        msearch_query, 
    )

with a high volume of records makes the job crash, not on the Opensearch side, but on the python client side.

It is important to note that the records are querying any of the 50 or so indices we have, so not all records on the msearch call go to the same index.

However, using the requests library directly (with the aws-auth library for authentication) works perfectly.

#this works with no problem
resp = requests.post( 'https://'+endpoint+'/_msearch', data=msearch_query, headers={'Content-Type': 'application/json'}, timeout=500)

What is the expected behavior?

python client should handle the request, or if the return body from the multisearch operation is too big, raise an appropriate exception

What is your host/environment?

opensearchpy 2.2.0

OS: ProductName: macOS ProductVersion: 14.0 BuildVersion: 23A344

manugarri commented 1 year ago

UPDATE, i realised that the issue is still happening when using the requests library. Im not sure why would an msearch request hang when the cluster is done with the actual search , but it is not an issue with this library.

In fact sometimes the query succeeds but the return message is '{\n "message": "Request Timeout",\n}' Curiously only queries that fail are those that take above 300 seconds, which means this is probably related to some timeout networking settings i cant seem to be able to find.