Closed prrao87 closed 1 year ago
I should add, although these benchmarks were run on an M2 mac, I've also tried running similar code for indexing both Meilisearch and Elasticsearch on regular intel CPU machines, and I think there are a host of issues with the way green threads are available on various systems.. MacOS arm64 architectures seem to have by far the most "free" threads (several thousand) at given time, but other systems don't. So combining a multi-process + async approach doesn't work as well on intel-based Linux systems.
Based on my experience in real settings, I feel like just using async alone is so much easier to develop, test and maintain. Curious to hear how your experience has been, @sanders41!
I tried running a cProfile but got some errors running the script and haven't had a chance to figure out why. My guess is the overhead of starting the processes is negating the performance benefit bringing it in line with the sync client. This is probably even more true with the speed ups from Pydantic 2.
I agree with you that async alone is easier. What I do usually is reach for multiprocessing once performance becomes an issue. Once a dataset gets that big the overhead usually becomes worth it.
I wonder if you would see any difference by using add_documents_in_batches
(available in both the sync and async clients) instead of doing the chunking yourself. It looks like roughly the same idea so might not make a difference, but it may be worth a try just to see.
update_documents_in_batches
as suggested, and indeed, it's faster! Clearly, it helps to avoid doing pure-Python operations wherever possible.tqdm.asyncio
to see async batch progresspython bulk_index_sync.py -b 1
Finished updating database index settings
100%|██████████████████████████████| 1/1 [00:01<00:00, 1.56s/it]
Bulk index took 2.2505 seconds
python bulk_index_async.py -b 1
Finished updating database index settings
100%|██████████████████████████████| 1/1 [00:00<00:00, 1.25it/s]
Finished running benchmarks
Bulk index took 1.5184 seconds
python bulk_index_sync.py -b 10
Finished updating database index settings
100%|████████████████████████████| 10/10 [00:15<00:00, 1.54s/it]
Bulk index took 16.1129 seconds
python bulk_index_async.py -b 10
Finished updating database index settings
100%|████████████████████████████| 10/10 [00:09<00:00, 1.10it/s]
Finished running benchmarks
Bulk index took 9.7602 seconds
python bulk_index_sync.py -b 100
Finished updating database index settings
100%|██████████████████████████| 100/100 [02:43<00:00, 1.64s/it]
Bulk index took 164.3521 seconds
There were memory issues when ingesting batches of 10k at a time for these many runs. As a result, smaller batches were supplied. The run time when using a smaller batch size of 2k is slightly larger than when using a batch size as 5k, which makes sense.
python bulk_index_async.py -b 100 --chunksize 5000
Finished updating database index settings
100%|███████████████████████████████████| 100/100 [01:32<00:00, 1.08it/s]
Finished running benchmarks
Bulk index took 93.1814 seconds
python bulk_index_async.py -b 100 --chunksize 2000
Finished updating database index settings
100%|████████████████████████| 100/100 [01:40<00:00, 1.01s/it]
Finished running benchmarks
Bulk index took 101.4677 seconds
In general, it's recommended to optimize the batch size to be as large as permissible when using
update_documents_in_batches
.
I decided against going forward with the multiproc + async approach due to unnecessary complexity. The code is harder to reason about, and not very readable. The async method works great! As does uploading the documents in batches (assuming the batches are of a reasonably small size such that Rust's multi-threaded execution can happen smoothly in the background without running out of memory). I'll have to tune this more carefully in a real world case! Thanks for the tips @sanders41 👊🏽
Do you think the Rust memory issue was caused by how meilisearch-python-async
was sending things, or was it purely on the Meilisearch side?
If you think it was related to the client feel free to open an issue with your findings and I'll see if I can come up with a solution.
@sanders41 I actually don't know what the cause of the memory issue could be -- you may be right; the error could be with how the async client is sending the batches concurrently to the backend. There's a gigantic stack trace whose error messages make no sense to me when I run the async version with a batch size of 10k. Do you think I should file an issue? I'm not sure if I can create an MRE with any other random data.
python bulk_index_async.py -b 100 --chunksize 10000
Only a part of the stack trace is shown below.
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/Users/prrao/.pyenv/versions/3.11.2/lib/python3.11/site-packages/meilisearch_python_async/_http_requests.py", line 43, in _send_request
response = await http_method(path, json=body, headers=headers)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/prrao/.pyenv/versions/3.11.2/lib/python3.11/site-packages/httpx/_client.py", line 1885, in put
return await self.request(
^^^^^^^^^^^^^^^^^^^
File "/Users/prrao/.pyenv/versions/3.11.2/lib/python3.11/site-packages/httpx/_client.py", line 1530, in request
return await self.send(request, auth=auth, follow_redirects=follow_redirects)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/prrao/.pyenv/versions/3.11.2/lib/python3.11/site-packages/httpx/_client.py", line 1617, in send
response = await self._send_handling_auth(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/prrao/.pyenv/versions/3.11.2/lib/python3.11/site-packages/httpx/_client.py", line 1645, in _send_handling_auth
response = await self._send_handling_redirects(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/prrao/.pyenv/versions/3.11.2/lib/python3.11/site-packages/httpx/_client.py", line 1682, in _send_handling_redirects
response = await self._send_single_request(request)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/prrao/.pyenv/versions/3.11.2/lib/python3.11/site-packages/httpx/_client.py", line 1719, in _send_single_request
response = await transport.handle_async_request(request)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/prrao/.pyenv/versions/3.11.2/lib/python3.11/site-packages/httpx/_transports/default.py", line 352, in handle_async_request
with map_httpcore_exceptions():
File "/Users/prrao/.pyenv/versions/3.11.2/lib/python3.11/contextlib.py", line 155, in __exit__
self.gen.throw(typ, value, traceback)
File "/Users/prrao/.pyenv/versions/3.11.2/lib/python3.11/site-packages/httpx/_transports/default.py", line 77, in map_httpcore_exceptions
raise mapped_exc(message) from exc
httpx.WriteError
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/Users/prrao/.pyenv/versions/3.11.2/lib/python3.11/site-packages/tqdm/asyncio.py", line 76, in wrap_awaitable
return i, await f
^^^^^^^
File "/Users/prrao/.pyenv/versions/3.11.2/lib/python3.11/site-packages/meilisearch_python_async/index.py", line 948, in update_documents_in_batches
return await gather(*batches)
^^^^^^^^^^^^^^^^^^^^^^
File "/Users/prrao/.pyenv/versions/3.11.2/lib/python3.11/site-packages/meilisearch_python_async/index.py", line 904, in update_documents
response = await self._http_requests.put(url, documents)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/prrao/.pyenv/versions/3.11.2/lib/python3.11/site-packages/meilisearch_python_async/_http_requests.py", line 73, in put
return await self._send_request(self.http_client.put, path, body, content_type)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/prrao/.pyenv/versions/3.11.2/lib/python3.11/site-packages/meilisearch_python_async/_http_requests.py", line 55, in _send_request
raise MeilisearchError(str(err)) # pragma: no cover
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
meilisearch_python_async.errors.MeilisearchError: MeilisearchError. Error message: .
Task exception was never retrieved
future: <Task finished name='Task-4' coro=<tqdm_asyncio.gather.<locals>.wrap_awaitable() done, defined at /Users/prrao/.pyenv/versions/3.11.2/lib/python3.11/site-packages/tqdm/asyncio.py:75> exception=MeilisearchError('')>
Traceback (most recent call last):
File "/Users/prrao/.pyenv/versions/3.11.2/lib/python3.11/asyncio/selector_events.py", line 1082, in _write_ready
n = self._sock.send(self._buffer)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
OSError: [Errno 55] No buffer space available
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/Users/prrao/.pyenv/versions/3.11.2/lib/python3.11/site-packages/httpcore/_exceptions.py", line 10, in map_exceptions
yield
File "/Users/prrao/.pyenv/versions/3.11.2/lib/python3.11/site-packages/httpcore/_backends/anyio.py", line 34, in read
return await self._stream.receive(max_bytes=max_bytes)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/prrao/.pyenv/versions/3.11.2/lib/python3.11/site-packages/anyio/_backends/_asyncio.py", line 1212, in receive
raise self._protocol.exception
anyio.BrokenResourceError
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/Users/prrao/.pyenv/versions/3.11.2/lib/python3.11/site-packages/httpx/_transports/default.py", line 60, in map_httpcore_exceptions
yield
File "/Users/prrao/.pyenv/versions/3.11.2/lib/python3.11/site-packages/httpx/_transports/default.py", line 353, in handle_async_request
resp = await self._pool.handle_async_request(req)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/prrao/.pyenv/versions/3.11.2/lib/python3.11/site-packages/httpcore/_async/connection_pool.py", line 262, in handle_async_request
raise exc
File "/Users/prrao/.pyenv/versions/3.11.2/lib/python3.11/site-packages/httpcore/_async/connection_pool.py", line 245, in handle_async_request
response = await connection.handle_async_request(request)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/prrao/.pyenv/versions/3.11.2/lib/python3.11/site-packages/httpcore/_async/connection.py", line 96, in handle_async_request
return await self._connection.handle_async_request(request)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/prrao/.pyenv/versions/3.11.2/lib/python3.11/site-packages/httpcore/_async/http11.py", line 121, in handle_async_request
raise exc
File "/Users/prrao/.pyenv/versions/3.11.2/lib/python3.11/site-packages/httpcore/_async/http11.py", line 99, in handle_async_request
) = await self._receive_response_headers(**kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/prrao/.pyenv/versions/3.11.2/lib/python3.11/site-packages/httpcore/_async/http11.py", line 164, in _receive_response_headers
event = await self._receive_event(timeout=timeout)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/prrao/.pyenv/versions/3.11.2/lib/python3.11/site-packages/httpcore/_async/http11.py", line 200, in _receive_event
data = await self._network_stream.read(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/prrao/.pyenv/versions/3.11.2/lib/python3.11/site-packages/httpcore/_backends/anyio.py", line 31, in read
with map_exceptions(exc_map):
File "/Users/prrao/.pyenv/versions/3.11.2/lib/python3.11/contextlib.py", line 155, in __exit__
self.gen.throw(typ, value, traceback)
File "/Users/prrao/.pyenv/versions/3.11.2/lib/python3.11/site-packages/httpcore/_exceptions.py", line 14, in map_exceptions
raise to_exc(exc) from exc
httpcore.ReadError
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/Users/prrao/.pyenv/versions/3.11.2/lib/python3.11/site-packages/meilisearch_python_async/_http_requests.py", line 43, in _send_request
response = await http_method(path, json=body, headers=headers)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/prrao/.pyenv/versions/3.11.2/lib/python3.11/site-packages/httpx/_client.py", line 1885, in put
return await self.request(
^^^^^^^^^^^^^^^^^^^
File "/Users/prrao/.pyenv/versions/3.11.2/lib/python3.11/site-packages/httpx/_client.py", line 1530, in request
return await self.send(request, auth=auth, follow_redirects=follow_redirects)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/prrao/.pyenv/versions/3.11.2/lib/python3.11/site-packages/httpx/_client.py", line 1617, in send
response = await self._send_handling_auth(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/prrao/.pyenv/versions/3.11.2/lib/python3.11/site-packages/httpx/_client.py", line 1645, in _send_handling_auth
response = await self._send_handling_redirects(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/prrao/.pyenv/versions/3.11.2/lib/python3.11/site-packages/httpx/_client.py", line 1682, in _send_handling_redirects
response = await self._send_single_request(request)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/prrao/.pyenv/versions/3.11.2/lib/python3.11/site-packages/httpx/_client.py", line 1719, in _send_single_request
response = await transport.handle_async_request(request)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/prrao/.pyenv/versions/3.11.2/lib/python3.11/site-packages/httpx/_transports/default.py", line 352, in handle_async_request
with map_httpcore_exceptions():
File "/Users/prrao/.pyenv/versions/3.11.2/lib/python3.11/contextlib.py", line 155, in __exit__
self.gen.throw(typ, value, traceback)
File "/Users/prrao/.pyenv/versions/3.11.2/lib/python3.11/site-packages/httpx/_transports/default.py", line 77, in map_httpcore_exceptions
raise mapped_exc(message) from exc
httpx.ReadError
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/Users/prrao/.pyenv/versions/3.11.2/lib/python3.11/site-packages/tqdm/asyncio.py", line 76, in wrap_awaitable
return i, await f
^^^^^^^
File "/Users/prrao/.pyenv/versions/3.11.2/lib/python3.11/site-packages/meilisearch_python_async/index.py", line 948, in update_documents_in_batches
return await gather(*batches)
^^^^^^^^^^^^^^^^^^^^^^
File "/Users/prrao/.pyenv/versions/3.11.2/lib/python3.11/site-packages/meilisearch_python_async/index.py", line 904, in update_documents
response = await self._http_requests.put(url, documents)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/prrao/.pyenv/versions/3.11.2/lib/python3.11/site-packages/meilisearch_python_async/_http_requests.py", line 73, in put
return await self._send_request(self.http_client.put, path, body, content_type)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/prrao/.pyenv/versions/3.11.2/lib/python3.11/site-packages/meilisearch_python_async/_http_requests.py", line 55, in _send_request
raise MeilisearchError(str(err)) # pragma: no cover
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
meilisearch_python_async.errors.MeilisearchError: MeilisearchError. Error message: .
Task exception was never retrieved
future: <Task finished name='Task-5' coro=<tqdm_asyncio.gather.<locals>.wrap_awaitable() done, defined at /Users/prrao/.pyenv/versions/3.11.2/lib/python3.11/site-packages/tqdm/asyncio.py:75> exception=MeilisearchError('')>
Traceback (most recent call last):
File "/Users/prrao/.pyenv/versions/3.11.2/lib/python3.11/asyncio/selector_events.py", line 1082, in _write_ready
n = self._sock.send(self._buffer)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
OSError: [Errno 55] No buffer space available
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/Users/prrao/.pyenv/versions/3.11.2/lib/python3.11/site-packages/httpcore/_exceptions.py", line 10, in map_exceptions
yield
File "/Users/prrao/.pyenv/versions/3.11.2/lib/python3.11/site-packages/httpcore/_backends/anyio.py", line 34, in read
return await self._stream.receive(max_bytes=max_bytes)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/prrao/.pyenv/versions/3.11.2/lib/python3.11/site-packages/anyio/_backends/_asyncio.py", line 1212, in receive
raise self._protocol.exception
anyio.BrokenResourceError
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/Users/prrao/.pyenv/versions/3.11.2/lib/python3.11/site-packages/httpx/_transports/default.py", line 60, in map_httpcore_exceptions
yield
File "/Users/prrao/.pyenv/versions/3.11.2/lib/python3.11/site-packages/httpx/_transports/default.py", line 353, in handle_async_request
resp = await self._pool.handle_async_request(req)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/prrao/.pyenv/versions/3.11.2/lib/python3.11/site-packages/httpcore/_async/connection_pool.py", line 262, in handle_async_request
raise exc
File "/Users/prrao/.pyenv/versions/3.11.2/lib/python3.11/site-packages/httpcore/_async/connection_pool.py", line 245, in handle_async_request
response = await connection.handle_async_request(request)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/prrao/.pyenv/versions/3.11.2/lib/python3.11/site-packages/httpcore/_async/connection.py", line 96, in handle_async_request
return await self._connection.handle_async_request(request)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/prrao/.pyenv/versions/3.11.2/lib/python3.11/site-packages/httpcore/_async/http11.py", line 121, in handle_async_request
raise exc
File "/Users/prrao/.pyenv/versions/3.11.2/lib/python3.11/site-packages/httpcore/_async/http11.py", line 99, in handle_async_request
) = await self._receive_response_headers(**kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/prrao/.pyenv/versions/3.11.2/lib/python3.11/site-packages/httpcore/_async/http11.py", line 164, in _receive_response_headers
event = await self._receive_event(timeout=timeout)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/prrao/.pyenv/versions/3.11.2/lib/python3.11/site-packages/httpcore/_async/http11.py", line 200, in _receive_event
data = await self._network_stream.read(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/prrao/.pyenv/versions/3.11.2/lib/python3.11/site-packages/httpcore/_backends/anyio.py", line 31, in read
with map_exceptions(exc_map):
File "/Users/prrao/.pyenv/versions/3.11.2/lib/python3.11/contextlib.py", line 155, in __exit__
self.gen.throw(typ, value, traceback)
File "/Users/prrao/.pyenv/versions/3.11.2/lib/python3.11/site-packages/httpcore/_exceptions.py", line 14, in map_exceptions
raise to_exc(exc) from exc
httpcore.ReadError
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/Users/prrao/.pyenv/versions/3.11.2/lib/python3.11/site-packages/meilisearch_python_async/_http_requests.py", line 43, in _send_request
response = await http_method(path, json=body, headers=headers)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/prrao/.pyenv/versions/3.11.2/lib/python3.11/site-packages/httpx/_client.py", line 1885, in put
return await self.request(
^^^^^^^^^^^^^^^^^^^
File "/Users/prrao/.pyenv/versions/3.11.2/lib/python3.11/site-packages/httpx/_client.py", line 1530, in request
return await self.send(request, auth=auth, follow_redirects=follow_redirects)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/prrao/.pyenv/versions/3.11.2/lib/python3.11/site-packages/httpx/_client.py", line 1617, in send
response = await self._send_handling_auth(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/prrao/.pyenv/versions/3.11.2/lib/python3.11/site-packages/httpx/_client.py", line 1645, in _send_handling_auth
response = await self._send_handling_redirects(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/prrao/.pyenv/versions/3.11.2/lib/python3.11/site-packages/httpx/_client.py", line 1682, in _send_handling_redirects
response = await self._send_single_request(request)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/prrao/.pyenv/versions/3.11.2/lib/python3.11/site-packages/httpx/_client.py", line 1719, in _send_single_request
response = await transport.handle_async_request(request)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/prrao/.pyenv/versions/3.11.2/lib/python3.11/site-packages/httpx/_transports/default.py", line 352, in handle_async_request
with map_httpcore_exceptions():
File "/Users/prrao/.pyenv/versions/3.11.2/lib/python3.11/contextlib.py", line 155, in __exit__
self.gen.throw(typ, value, traceback)
File "/Users/prrao/.pyenv/versions/3.11.2/lib/python3.11/site-packages/httpx/_transports/default.py", line 77, in map_httpcore_exceptions
raise mapped_exc(message) from exc
httpx.ReadError
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/Users/prrao/.pyenv/versions/3.11.2/lib/python3.11/site-packages/tqdm/asyncio.py", line 76, in wrap_awaitable
return i, await f
^^^^^^^
File "/Users/prrao/.pyenv/versions/3.11.2/lib/python3.11/site-packages/meilisearch_python_async/index.py", line 948, in update_documents_in_batches
return await gather(*batches)
^^^^^^^^^^^^^^^^^^^^^^
File "/Users/prrao/.pyenv/versions/3.11.2/lib/python3.11/site-packages/meilisearch_python_async/index.py", line 904, in update_documents
response = await self._http_requests.put(url, documents)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/prrao/.pyenv/versions/3.11.2/lib/python3.11/site-packages/meilisearch_python_async/_http_requests.py", line 73, in put
return await self._send_request(self.http_client.put, path, body, content_type)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/prrao/.pyenv/versions/3.11.2/lib/python3.11/site-packages/meilisearch_python_async/_http_requests.py", line 55, in _send_request
raise MeilisearchError(str(err)) # pragma: no cover
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
meilisearch_python_async.errors.MeilisearchError: MeilisearchError. Error message: .
Task exception was never retrieved
future: <Task finished name='Task-6' coro=<tqdm_asyncio.gather.<locals>.wrap_awaitable() done, defined at /Users/prrao/.pyenv/versions/3.11.2/lib/python3.11/site-packages/tqdm/asyncio.py:75> exception=MeilisearchError('')>
Traceback (most recent call last):
File "/Users/prrao/.pyenv/versions/3.11.2/lib/python3.11/asyncio/selector_events.py", line 1082, in _write_ready
n = self._sock.send(self._buffer)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
OSError: [Errno 55] No buffer space available
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/Users/prrao/.pyenv/versions/3.11.2/lib/python3.11/site-packages/httpcore/_exceptions.py", line 10, in map_exceptions
yield
File "/Users/prrao/.pyenv/versions/3.11.2/lib/python3.11/site-packages/httpcore/_backends/anyio.py", line 34, in read
return await self._stream.receive(max_bytes=max_bytes)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/prrao/.pyenv/versions/3.11.2/lib/python3.11/site-packages/anyio/_backends/_asyncio.py", line 1212, in receive
raise self._protocol.exception
anyio.BrokenResourceError
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/Users/prrao/.pyenv/versions/3.11.2/lib/python3.11/site-packages/httpx/_transports/default.py", line 60, in map_httpcore_exceptions
yield
File "/Users/prrao/.pyenv/versions/3.11.2/lib/python3.11/site-packages/httpx/_transports/default.py", line 353, in handle_async_request
resp = await self._pool.handle_async_request(req)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/prrao/.pyenv/versions/3.11.2/lib/python3.11/site-packages/httpcore/_async/connection_pool.py", line 262, in handle_async_request
raise exc
File "/Users/prrao/.pyenv/versions/3.11.2/lib/python3.11/site-packages/httpcore/_async/connection_pool.py", line 245, in handle_async_request
response = await connection.handle_async_request(request)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/prrao/.pyenv/versions/3.11.2/lib/python3.11/site-packages/httpcore/_async/connection.py", line 96, in handle_async_request
return await self._connection.handle_async_request(request)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/prrao/.pyenv/versions/3.11.2/lib/python3.11/site-packages/httpcore/_async/http11.py", line 121, in handle_async_request
raise exc
File "/Users/prrao/.pyenv/versions/3.11.2/lib/python3.11/site-packages/httpcore/_async/http11.py", line 99, in handle_async_request
) = await self._receive_response_headers(**kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/prrao/.pyenv/versions/3.11.2/lib/python3.11/site-packages/httpcore/_async/http11.py", line 164, in _receive_response_headers
event = await self._receive_event(timeout=timeout)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/prrao/.pyenv/versions/3.11.2/lib/python3.11/site-packages/httpcore/_async/http11.py", line 200, in _receive_event
data = await self._network_stream.read(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/prrao/.pyenv/versions/3.11.2/lib/python3.11/site-packages/httpcore/_backends/anyio.py", line 31, in read
with map_exceptions(exc_map):
File "/Users/prrao/.pyenv/versions/3.11.2/lib/python3.11/contextlib.py", line 155, in __exit__
self.gen.throw(typ, value, traceback)
File "/Users/prrao/.pyenv/versions/3.11.2/lib/python3.11/site-packages/httpcore/_exceptions.py", line 14, in map_exceptions
raise to_exc(exc) from exc
httpcore.ReadError
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/Users/prrao/.pyenv/versions/3.11.2/lib/python3.11/site-packages/meilisearch_python_async/_http_requests.py", line 43, in _send_request
response = await http_method(path, json=body, headers=headers)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/prrao/.pyenv/versions/3.11.2/lib/python3.11/site-packages/httpx/_client.py", line 1885, in put
return await self.request(
^^^^^^^^^^^^^^^^^^^
File "/Users/prrao/.pyenv/versions/3.11.2/lib/python3.11/site-packages/httpx/_client.py", line 1530, in request
return await self.send(request, auth=auth, follow_redirects=follow_redirects)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/prrao/.pyenv/versions/3.11.2/lib/python3.11/site-packages/httpx/_client.py", line 1617, in send
response = await self._send_handling_auth(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/prrao/.pyenv/versions/3.11.2/lib/python3.11/site-packages/httpx/_client.py", line 1645, in _send_handling_auth
response = await self._send_handling_redirects(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/prrao/.pyenv/versions/3.11.2/lib/python3.11/site-packages/httpx/_client.py", line 1682, in _send_handling_redirects
response = await self._send_single_request(request)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/prrao/.pyenv/versions/3.11.2/lib/python3.11/site-packages/httpx/_client.py", line 1719, in _send_single_request
response = await transport.handle_async_request(request)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/prrao/.pyenv/versions/3.11.2/lib/python3.11/site-packages/httpx/_transports/default.py", line 352, in handle_async_request
with map_httpcore_exceptions():
File "/Users/prrao/.pyenv/versions/3.11.2/lib/python3.11/contextlib.py", line 155, in __exit__
self.gen.throw(typ, value, traceback)
File "/Users/prrao/.pyenv/versions/3.11.2/lib/python3.11/site-packages/httpx/_transports/default.py", line 77, in map_httpcore_exceptions
raise mapped_exc(message) from exc
httpx.ReadError
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/Users/prrao/.pyenv/versions/3.11.2/lib/python3.11/site-packages/tqdm/asyncio.py", line 76, in wrap_awaitable
return i, await f
^^^^^^^
File "/Users/prrao/.pyenv/versions/3.11.2/lib/python3.11/site-packages/meilisearch_python_async/index.py", line 948, in update_documents_in_batches
return await gather(*batches)
^^^^^^^^^^^^^^^^^^^^^^
File "/Users/prrao/.pyenv/versions/3.11.2/lib/python3.11/site-packages/meilisearch_python_async/index.py", line 904, in update_documents
response = await self._http_requests.put(url, documents)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/prrao/.pyenv/versions/3.11.2/lib/python3.11/site-packages/meilisearch_python_async/_http_requests.py", line 73, in put
return await self._send_request(self.http_client.put, path, body, content_type)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/prrao/.pyenv/versions/3.11.2/lib/python3.11/site-packages/meilisearch_python_async/_http_requests.py", line 55, in _send_request
raise MeilisearchError(str(err)) # pragma: no cover
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
meilisearch_python_async.errors.MeilisearchError: MeilisearchError. Error message: .
My initial hunch was that the larger the batch size, the more data that's concurrently being processed by Rust on the Meilisearch side, and considering it's all multi-threaded + async Rust, there's only so many processes that can be handled with the given amount of memory. It could explain why there are absolutely no issues with smaller batch sizes.
The stack trace does make it look like it's on the Meilisearch side. My first thought was maybe it was sending too many requests at once, but a larger batch size would actually send less requests. You are probably correct in your hunch.
Will you go ahead and open an issue? You can just reference this discussion, no need for an MRE. I may not be able to do anything about the issue itself, but I'm wondering if there is a way I could use exception groups in Python 3.11+ to give a better error message without slowing things down and making it messy trying to figure out which Python version is being used.
Really I'm not sure if exceptioins groups could give a better message or not, I haven't used them yet since in all my libraries I'm supporting 3.8+. I figure it's at least worth looking into though.
Goals
The aim of this PR is to run a benchmark of a bulk index scenario in Meilisearch, using one of three methods: sync, async, and async + multiproc, per @sanders41 in #15.
Three scripts are created for this purpose, included in this PR. The
meilisearch
official (sync) python client is also included inrequirements.txt
.Results
Case 1: 1 run
Sync
Bulk index took 2.2454 seconds
Async
Bulk index took 1.5590 seconds
Multiprocessor async
Bulk index took 3.5335 seconds
Case 2: 10 runs
Sync
Bulk index took 22.3101 seconds
Async
Bulk index took 16.0275 seconds
Multiprocessor async
Bulk index took 22.5052 seconds
Case 3: 100 runs
Sync
Bulk index took 231.9254 seconds
Async
Bulk index took 165.9993 seconds
Multiprocessor Async
Bulk index took 228.9810 seconds
Observations
meilisearch-python-async
client's async coroutines outperform the version on all counts, and this difference will become more and more prominent as the dataset gets larger and larger (no question that async is very valuable in this scenario)aiofiles
, as discussed, has its own issues with consistency across different OSes and file systems)@sanders41 Do you think there's anything fundamentally off about the code in the async + multiprocessing case? I'm a bit surprised at the fact that the performance is comparable to the sync
meilisearch
Python client -- at the very least, I'd have expected the performance to be in between the sync and async versions, closer to async than sync.The way these numbers look, I'm tempted to just remove the code you originally suggested in #15 altogether -- the async version is super intuitive, blazing fast, and is extremely readable (thus maintainable in a real setting).
I'd love to hear your thoughts when you've had the chance to take a quick stab at it. Hopefully, the code and benchmark should be reproducible enough on your system the way it's been written. Cheers!