prrao87 / db-hub-fastapi

Async bulk data ingestion and querying in various document, graph and vector databases via their Python clients
MIT License
33 stars 3 forks source link

Meilisearch bulk index benchmark #41

Closed prrao87 closed 1 year ago

prrao87 commented 1 year ago

Goals

The aim of this PR is to run a benchmark of a bulk index scenario in Meilisearch, using one of three methods: sync, async, and async + multiproc, per @sanders41 in #15.

Three scripts are created for this purpose, included in this PR. The meilisearch official (sync) python client is also included in requirements.txt.

Results

cd scripts

Case 1: 1 run

Sync

python bulk_index_sync.py -b 1

Bulk index took 2.2454 seconds

Async

python bulk_index_async.py -b 1

Bulk index took 1.5590 seconds

Multiprocessor async

python bulk_index_multiproc_async.py -b 1

Bulk index took 3.5335 seconds


Case 2: 10 runs

Sync

python bulk_index_sync.py -b 10

Bulk index took 22.3101 seconds

Async

python bulk_index_async.py -b 10

Bulk index took 16.0275 seconds

Multiprocessor async

python bulk_index_multiproc_async.py -b 10

Bulk index took 22.5052 seconds


Case 3: 100 runs

Sync

python bulk_index_sync.py -b 100

Bulk index took 231.9254 seconds

Async

python bulk_index_async.py -b 100

Bulk index took 165.9993 seconds

Multiprocessor Async

python bulk_index_multiproc_async.py -b 100

Bulk index took 228.9810 seconds


Observations

@sanders41 Do you think there's anything fundamentally off about the code in the async + multiprocessing case? I'm a bit surprised at the fact that the performance is comparable to the sync meilisearch Python client -- at the very least, I'd have expected the performance to be in between the sync and async versions, closer to async than sync.

The way these numbers look, I'm tempted to just remove the code you originally suggested in #15 altogether -- the async version is super intuitive, blazing fast, and is extremely readable (thus maintainable in a real setting).

I'd love to hear your thoughts when you've had the chance to take a quick stab at it. Hopefully, the code and benchmark should be reproducible enough on your system the way it's been written. Cheers!

prrao87 commented 1 year ago

I should add, although these benchmarks were run on an M2 mac, I've also tried running similar code for indexing both Meilisearch and Elasticsearch on regular intel CPU machines, and I think there are a host of issues with the way green threads are available on various systems.. MacOS arm64 architectures seem to have by far the most "free" threads (several thousand) at given time, but other systems don't. So combining a multi-process + async approach doesn't work as well on intel-based Linux systems.

Based on my experience in real settings, I feel like just using async alone is so much easier to develop, test and maintain. Curious to hear how your experience has been, @sanders41!

sanders41 commented 1 year ago

I tried running a cProfile but got some errors running the script and haven't had a chance to figure out why. My guess is the overhead of starting the processes is negating the performance benefit bringing it in line with the sync client. This is probably even more true with the speed ups from Pydantic 2.

I agree with you that async alone is easier. What I do usually is reach for multiprocessing once performance becomes an issue. Once a dataset gets that big the overhead usually becomes worth it.

I wonder if you would see any difference by using add_documents_in_batches (available in both the sync and async clients) instead of doing the chunking yourself. It looks like roughly the same idea so might not make a difference, but it may be worth a try just to see.

prrao87 commented 1 year ago

Updates

Timing: 1 Run

Sync

python bulk_index_sync.py -b 1
Finished updating database index settings
100%|██████████████████████████████| 1/1 [00:01<00:00,  1.56s/it]
Bulk index took 2.2505 seconds

Async

python bulk_index_async.py -b 1
Finished updating database index settings
100%|██████████████████████████████| 1/1 [00:00<00:00,  1.25it/s]
Finished running benchmarks
Bulk index took 1.5184 seconds

Timing: 10 runs

Sync

python bulk_index_sync.py -b 10
Finished updating database index settings
100%|████████████████████████████| 10/10 [00:15<00:00,  1.54s/it]
Bulk index took 16.1129 seconds

Async

python bulk_index_async.py -b 10
Finished updating database index settings
100%|████████████████████████████| 10/10 [00:09<00:00,  1.10it/s]
Finished running benchmarks
Bulk index took 9.7602 seconds

Timing: 100 Runs

Sync

python bulk_index_sync.py -b 100
Finished updating database index settings
100%|██████████████████████████| 100/100 [02:43<00:00,  1.64s/it]
Bulk index took 164.3521 seconds

Async

There were memory issues when ingesting batches of 10k at a time for these many runs. As a result, smaller batches were supplied. The run time when using a smaller batch size of 2k is slightly larger than when using a batch size as 5k, which makes sense.

python bulk_index_async.py -b 100 --chunksize 5000
Finished updating database index settings
100%|███████████████████████████████████| 100/100 [01:32<00:00,  1.08it/s]
Finished running benchmarks
Bulk index took 93.1814 seconds
python bulk_index_async.py -b 100 --chunksize 2000
Finished updating database index settings
100%|████████████████████████| 100/100 [01:40<00:00,  1.01s/it]
Finished running benchmarks
Bulk index took 101.4677 seconds

In general, it's recommended to optimize the batch size to be as large as permissible when using update_documents_in_batches.

prrao87 commented 1 year ago

I decided against going forward with the multiproc + async approach due to unnecessary complexity. The code is harder to reason about, and not very readable. The async method works great! As does uploading the documents in batches (assuming the batches are of a reasonably small size such that Rust's multi-threaded execution can happen smoothly in the background without running out of memory). I'll have to tune this more carefully in a real world case! Thanks for the tips @sanders41 👊🏽

sanders41 commented 1 year ago

Do you think the Rust memory issue was caused by how meilisearch-python-async was sending things, or was it purely on the Meilisearch side?

If you think it was related to the client feel free to open an issue with your findings and I'll see if I can come up with a solution.

prrao87 commented 1 year ago

@sanders41 I actually don't know what the cause of the memory issue could be -- you may be right; the error could be with how the async client is sending the batches concurrently to the backend. There's a gigantic stack trace whose error messages make no sense to me when I run the async version with a batch size of 10k. Do you think I should file an issue? I'm not sure if I can create an MRE with any other random data.

python bulk_index_async.py -b 100 --chunksize 10000 

Only a part of the stack trace is shown below.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/Users/prrao/.pyenv/versions/3.11.2/lib/python3.11/site-packages/meilisearch_python_async/_http_requests.py", line 43, in _send_request
    response = await http_method(path, json=body, headers=headers)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/prrao/.pyenv/versions/3.11.2/lib/python3.11/site-packages/httpx/_client.py", line 1885, in put
    return await self.request(
           ^^^^^^^^^^^^^^^^^^^
  File "/Users/prrao/.pyenv/versions/3.11.2/lib/python3.11/site-packages/httpx/_client.py", line 1530, in request
    return await self.send(request, auth=auth, follow_redirects=follow_redirects)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/prrao/.pyenv/versions/3.11.2/lib/python3.11/site-packages/httpx/_client.py", line 1617, in send
    response = await self._send_handling_auth(
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/prrao/.pyenv/versions/3.11.2/lib/python3.11/site-packages/httpx/_client.py", line 1645, in _send_handling_auth
    response = await self._send_handling_redirects(
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/prrao/.pyenv/versions/3.11.2/lib/python3.11/site-packages/httpx/_client.py", line 1682, in _send_handling_redirects
    response = await self._send_single_request(request)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/prrao/.pyenv/versions/3.11.2/lib/python3.11/site-packages/httpx/_client.py", line 1719, in _send_single_request
    response = await transport.handle_async_request(request)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/prrao/.pyenv/versions/3.11.2/lib/python3.11/site-packages/httpx/_transports/default.py", line 352, in handle_async_request
    with map_httpcore_exceptions():
  File "/Users/prrao/.pyenv/versions/3.11.2/lib/python3.11/contextlib.py", line 155, in __exit__
    self.gen.throw(typ, value, traceback)
  File "/Users/prrao/.pyenv/versions/3.11.2/lib/python3.11/site-packages/httpx/_transports/default.py", line 77, in map_httpcore_exceptions
    raise mapped_exc(message) from exc
httpx.WriteError

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/Users/prrao/.pyenv/versions/3.11.2/lib/python3.11/site-packages/tqdm/asyncio.py", line 76, in wrap_awaitable
    return i, await f
              ^^^^^^^
  File "/Users/prrao/.pyenv/versions/3.11.2/lib/python3.11/site-packages/meilisearch_python_async/index.py", line 948, in update_documents_in_batches
    return await gather(*batches)
           ^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/prrao/.pyenv/versions/3.11.2/lib/python3.11/site-packages/meilisearch_python_async/index.py", line 904, in update_documents
    response = await self._http_requests.put(url, documents)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/prrao/.pyenv/versions/3.11.2/lib/python3.11/site-packages/meilisearch_python_async/_http_requests.py", line 73, in put
    return await self._send_request(self.http_client.put, path, body, content_type)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/prrao/.pyenv/versions/3.11.2/lib/python3.11/site-packages/meilisearch_python_async/_http_requests.py", line 55, in _send_request
    raise MeilisearchError(str(err))  # pragma: no cover
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
meilisearch_python_async.errors.MeilisearchError: MeilisearchError. Error message: .
Task exception was never retrieved
future: <Task finished name='Task-4' coro=<tqdm_asyncio.gather.<locals>.wrap_awaitable() done, defined at /Users/prrao/.pyenv/versions/3.11.2/lib/python3.11/site-packages/tqdm/asyncio.py:75> exception=MeilisearchError('')>
Traceback (most recent call last):
  File "/Users/prrao/.pyenv/versions/3.11.2/lib/python3.11/asyncio/selector_events.py", line 1082, in _write_ready
    n = self._sock.send(self._buffer)
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
OSError: [Errno 55] No buffer space available

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/Users/prrao/.pyenv/versions/3.11.2/lib/python3.11/site-packages/httpcore/_exceptions.py", line 10, in map_exceptions
    yield
  File "/Users/prrao/.pyenv/versions/3.11.2/lib/python3.11/site-packages/httpcore/_backends/anyio.py", line 34, in read
    return await self._stream.receive(max_bytes=max_bytes)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/prrao/.pyenv/versions/3.11.2/lib/python3.11/site-packages/anyio/_backends/_asyncio.py", line 1212, in receive
    raise self._protocol.exception
anyio.BrokenResourceError

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/Users/prrao/.pyenv/versions/3.11.2/lib/python3.11/site-packages/httpx/_transports/default.py", line 60, in map_httpcore_exceptions
    yield
  File "/Users/prrao/.pyenv/versions/3.11.2/lib/python3.11/site-packages/httpx/_transports/default.py", line 353, in handle_async_request
    resp = await self._pool.handle_async_request(req)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/prrao/.pyenv/versions/3.11.2/lib/python3.11/site-packages/httpcore/_async/connection_pool.py", line 262, in handle_async_request
    raise exc
  File "/Users/prrao/.pyenv/versions/3.11.2/lib/python3.11/site-packages/httpcore/_async/connection_pool.py", line 245, in handle_async_request
    response = await connection.handle_async_request(request)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/prrao/.pyenv/versions/3.11.2/lib/python3.11/site-packages/httpcore/_async/connection.py", line 96, in handle_async_request
    return await self._connection.handle_async_request(request)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/prrao/.pyenv/versions/3.11.2/lib/python3.11/site-packages/httpcore/_async/http11.py", line 121, in handle_async_request
    raise exc
  File "/Users/prrao/.pyenv/versions/3.11.2/lib/python3.11/site-packages/httpcore/_async/http11.py", line 99, in handle_async_request
    ) = await self._receive_response_headers(**kwargs)
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/prrao/.pyenv/versions/3.11.2/lib/python3.11/site-packages/httpcore/_async/http11.py", line 164, in _receive_response_headers
    event = await self._receive_event(timeout=timeout)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/prrao/.pyenv/versions/3.11.2/lib/python3.11/site-packages/httpcore/_async/http11.py", line 200, in _receive_event
    data = await self._network_stream.read(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/prrao/.pyenv/versions/3.11.2/lib/python3.11/site-packages/httpcore/_backends/anyio.py", line 31, in read
    with map_exceptions(exc_map):
  File "/Users/prrao/.pyenv/versions/3.11.2/lib/python3.11/contextlib.py", line 155, in __exit__
    self.gen.throw(typ, value, traceback)
  File "/Users/prrao/.pyenv/versions/3.11.2/lib/python3.11/site-packages/httpcore/_exceptions.py", line 14, in map_exceptions
    raise to_exc(exc) from exc
httpcore.ReadError

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/Users/prrao/.pyenv/versions/3.11.2/lib/python3.11/site-packages/meilisearch_python_async/_http_requests.py", line 43, in _send_request
    response = await http_method(path, json=body, headers=headers)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/prrao/.pyenv/versions/3.11.2/lib/python3.11/site-packages/httpx/_client.py", line 1885, in put
    return await self.request(
           ^^^^^^^^^^^^^^^^^^^
  File "/Users/prrao/.pyenv/versions/3.11.2/lib/python3.11/site-packages/httpx/_client.py", line 1530, in request
    return await self.send(request, auth=auth, follow_redirects=follow_redirects)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/prrao/.pyenv/versions/3.11.2/lib/python3.11/site-packages/httpx/_client.py", line 1617, in send
    response = await self._send_handling_auth(
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/prrao/.pyenv/versions/3.11.2/lib/python3.11/site-packages/httpx/_client.py", line 1645, in _send_handling_auth
    response = await self._send_handling_redirects(
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/prrao/.pyenv/versions/3.11.2/lib/python3.11/site-packages/httpx/_client.py", line 1682, in _send_handling_redirects
    response = await self._send_single_request(request)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/prrao/.pyenv/versions/3.11.2/lib/python3.11/site-packages/httpx/_client.py", line 1719, in _send_single_request
    response = await transport.handle_async_request(request)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/prrao/.pyenv/versions/3.11.2/lib/python3.11/site-packages/httpx/_transports/default.py", line 352, in handle_async_request
    with map_httpcore_exceptions():
  File "/Users/prrao/.pyenv/versions/3.11.2/lib/python3.11/contextlib.py", line 155, in __exit__
    self.gen.throw(typ, value, traceback)
  File "/Users/prrao/.pyenv/versions/3.11.2/lib/python3.11/site-packages/httpx/_transports/default.py", line 77, in map_httpcore_exceptions
    raise mapped_exc(message) from exc
httpx.ReadError

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/Users/prrao/.pyenv/versions/3.11.2/lib/python3.11/site-packages/tqdm/asyncio.py", line 76, in wrap_awaitable
    return i, await f
              ^^^^^^^
  File "/Users/prrao/.pyenv/versions/3.11.2/lib/python3.11/site-packages/meilisearch_python_async/index.py", line 948, in update_documents_in_batches
    return await gather(*batches)
           ^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/prrao/.pyenv/versions/3.11.2/lib/python3.11/site-packages/meilisearch_python_async/index.py", line 904, in update_documents
    response = await self._http_requests.put(url, documents)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/prrao/.pyenv/versions/3.11.2/lib/python3.11/site-packages/meilisearch_python_async/_http_requests.py", line 73, in put
    return await self._send_request(self.http_client.put, path, body, content_type)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/prrao/.pyenv/versions/3.11.2/lib/python3.11/site-packages/meilisearch_python_async/_http_requests.py", line 55, in _send_request
    raise MeilisearchError(str(err))  # pragma: no cover
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
meilisearch_python_async.errors.MeilisearchError: MeilisearchError. Error message: .
Task exception was never retrieved
future: <Task finished name='Task-5' coro=<tqdm_asyncio.gather.<locals>.wrap_awaitable() done, defined at /Users/prrao/.pyenv/versions/3.11.2/lib/python3.11/site-packages/tqdm/asyncio.py:75> exception=MeilisearchError('')>
Traceback (most recent call last):
  File "/Users/prrao/.pyenv/versions/3.11.2/lib/python3.11/asyncio/selector_events.py", line 1082, in _write_ready
    n = self._sock.send(self._buffer)
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
OSError: [Errno 55] No buffer space available

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/Users/prrao/.pyenv/versions/3.11.2/lib/python3.11/site-packages/httpcore/_exceptions.py", line 10, in map_exceptions
    yield
  File "/Users/prrao/.pyenv/versions/3.11.2/lib/python3.11/site-packages/httpcore/_backends/anyio.py", line 34, in read
    return await self._stream.receive(max_bytes=max_bytes)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/prrao/.pyenv/versions/3.11.2/lib/python3.11/site-packages/anyio/_backends/_asyncio.py", line 1212, in receive
    raise self._protocol.exception
anyio.BrokenResourceError

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/Users/prrao/.pyenv/versions/3.11.2/lib/python3.11/site-packages/httpx/_transports/default.py", line 60, in map_httpcore_exceptions
    yield
  File "/Users/prrao/.pyenv/versions/3.11.2/lib/python3.11/site-packages/httpx/_transports/default.py", line 353, in handle_async_request
    resp = await self._pool.handle_async_request(req)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/prrao/.pyenv/versions/3.11.2/lib/python3.11/site-packages/httpcore/_async/connection_pool.py", line 262, in handle_async_request
    raise exc
  File "/Users/prrao/.pyenv/versions/3.11.2/lib/python3.11/site-packages/httpcore/_async/connection_pool.py", line 245, in handle_async_request
    response = await connection.handle_async_request(request)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/prrao/.pyenv/versions/3.11.2/lib/python3.11/site-packages/httpcore/_async/connection.py", line 96, in handle_async_request
    return await self._connection.handle_async_request(request)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/prrao/.pyenv/versions/3.11.2/lib/python3.11/site-packages/httpcore/_async/http11.py", line 121, in handle_async_request
    raise exc
  File "/Users/prrao/.pyenv/versions/3.11.2/lib/python3.11/site-packages/httpcore/_async/http11.py", line 99, in handle_async_request
    ) = await self._receive_response_headers(**kwargs)
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/prrao/.pyenv/versions/3.11.2/lib/python3.11/site-packages/httpcore/_async/http11.py", line 164, in _receive_response_headers
    event = await self._receive_event(timeout=timeout)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/prrao/.pyenv/versions/3.11.2/lib/python3.11/site-packages/httpcore/_async/http11.py", line 200, in _receive_event
    data = await self._network_stream.read(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/prrao/.pyenv/versions/3.11.2/lib/python3.11/site-packages/httpcore/_backends/anyio.py", line 31, in read
    with map_exceptions(exc_map):
  File "/Users/prrao/.pyenv/versions/3.11.2/lib/python3.11/contextlib.py", line 155, in __exit__
    self.gen.throw(typ, value, traceback)
  File "/Users/prrao/.pyenv/versions/3.11.2/lib/python3.11/site-packages/httpcore/_exceptions.py", line 14, in map_exceptions
    raise to_exc(exc) from exc
httpcore.ReadError

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/Users/prrao/.pyenv/versions/3.11.2/lib/python3.11/site-packages/meilisearch_python_async/_http_requests.py", line 43, in _send_request
    response = await http_method(path, json=body, headers=headers)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/prrao/.pyenv/versions/3.11.2/lib/python3.11/site-packages/httpx/_client.py", line 1885, in put
    return await self.request(
           ^^^^^^^^^^^^^^^^^^^
  File "/Users/prrao/.pyenv/versions/3.11.2/lib/python3.11/site-packages/httpx/_client.py", line 1530, in request
    return await self.send(request, auth=auth, follow_redirects=follow_redirects)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/prrao/.pyenv/versions/3.11.2/lib/python3.11/site-packages/httpx/_client.py", line 1617, in send
    response = await self._send_handling_auth(
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/prrao/.pyenv/versions/3.11.2/lib/python3.11/site-packages/httpx/_client.py", line 1645, in _send_handling_auth
    response = await self._send_handling_redirects(
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/prrao/.pyenv/versions/3.11.2/lib/python3.11/site-packages/httpx/_client.py", line 1682, in _send_handling_redirects
    response = await self._send_single_request(request)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/prrao/.pyenv/versions/3.11.2/lib/python3.11/site-packages/httpx/_client.py", line 1719, in _send_single_request
    response = await transport.handle_async_request(request)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/prrao/.pyenv/versions/3.11.2/lib/python3.11/site-packages/httpx/_transports/default.py", line 352, in handle_async_request
    with map_httpcore_exceptions():
  File "/Users/prrao/.pyenv/versions/3.11.2/lib/python3.11/contextlib.py", line 155, in __exit__
    self.gen.throw(typ, value, traceback)
  File "/Users/prrao/.pyenv/versions/3.11.2/lib/python3.11/site-packages/httpx/_transports/default.py", line 77, in map_httpcore_exceptions
    raise mapped_exc(message) from exc
httpx.ReadError

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/Users/prrao/.pyenv/versions/3.11.2/lib/python3.11/site-packages/tqdm/asyncio.py", line 76, in wrap_awaitable
    return i, await f
              ^^^^^^^
  File "/Users/prrao/.pyenv/versions/3.11.2/lib/python3.11/site-packages/meilisearch_python_async/index.py", line 948, in update_documents_in_batches
    return await gather(*batches)
           ^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/prrao/.pyenv/versions/3.11.2/lib/python3.11/site-packages/meilisearch_python_async/index.py", line 904, in update_documents
    response = await self._http_requests.put(url, documents)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/prrao/.pyenv/versions/3.11.2/lib/python3.11/site-packages/meilisearch_python_async/_http_requests.py", line 73, in put
    return await self._send_request(self.http_client.put, path, body, content_type)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/prrao/.pyenv/versions/3.11.2/lib/python3.11/site-packages/meilisearch_python_async/_http_requests.py", line 55, in _send_request
    raise MeilisearchError(str(err))  # pragma: no cover
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
meilisearch_python_async.errors.MeilisearchError: MeilisearchError. Error message: .
Task exception was never retrieved
future: <Task finished name='Task-6' coro=<tqdm_asyncio.gather.<locals>.wrap_awaitable() done, defined at /Users/prrao/.pyenv/versions/3.11.2/lib/python3.11/site-packages/tqdm/asyncio.py:75> exception=MeilisearchError('')>
Traceback (most recent call last):
  File "/Users/prrao/.pyenv/versions/3.11.2/lib/python3.11/asyncio/selector_events.py", line 1082, in _write_ready
    n = self._sock.send(self._buffer)
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
OSError: [Errno 55] No buffer space available

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/Users/prrao/.pyenv/versions/3.11.2/lib/python3.11/site-packages/httpcore/_exceptions.py", line 10, in map_exceptions
    yield
  File "/Users/prrao/.pyenv/versions/3.11.2/lib/python3.11/site-packages/httpcore/_backends/anyio.py", line 34, in read
    return await self._stream.receive(max_bytes=max_bytes)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/prrao/.pyenv/versions/3.11.2/lib/python3.11/site-packages/anyio/_backends/_asyncio.py", line 1212, in receive
    raise self._protocol.exception
anyio.BrokenResourceError

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/Users/prrao/.pyenv/versions/3.11.2/lib/python3.11/site-packages/httpx/_transports/default.py", line 60, in map_httpcore_exceptions
    yield
  File "/Users/prrao/.pyenv/versions/3.11.2/lib/python3.11/site-packages/httpx/_transports/default.py", line 353, in handle_async_request
    resp = await self._pool.handle_async_request(req)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/prrao/.pyenv/versions/3.11.2/lib/python3.11/site-packages/httpcore/_async/connection_pool.py", line 262, in handle_async_request
    raise exc
  File "/Users/prrao/.pyenv/versions/3.11.2/lib/python3.11/site-packages/httpcore/_async/connection_pool.py", line 245, in handle_async_request
    response = await connection.handle_async_request(request)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/prrao/.pyenv/versions/3.11.2/lib/python3.11/site-packages/httpcore/_async/connection.py", line 96, in handle_async_request
    return await self._connection.handle_async_request(request)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/prrao/.pyenv/versions/3.11.2/lib/python3.11/site-packages/httpcore/_async/http11.py", line 121, in handle_async_request
    raise exc
  File "/Users/prrao/.pyenv/versions/3.11.2/lib/python3.11/site-packages/httpcore/_async/http11.py", line 99, in handle_async_request
    ) = await self._receive_response_headers(**kwargs)
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/prrao/.pyenv/versions/3.11.2/lib/python3.11/site-packages/httpcore/_async/http11.py", line 164, in _receive_response_headers
    event = await self._receive_event(timeout=timeout)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/prrao/.pyenv/versions/3.11.2/lib/python3.11/site-packages/httpcore/_async/http11.py", line 200, in _receive_event
    data = await self._network_stream.read(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/prrao/.pyenv/versions/3.11.2/lib/python3.11/site-packages/httpcore/_backends/anyio.py", line 31, in read
    with map_exceptions(exc_map):
  File "/Users/prrao/.pyenv/versions/3.11.2/lib/python3.11/contextlib.py", line 155, in __exit__
    self.gen.throw(typ, value, traceback)
  File "/Users/prrao/.pyenv/versions/3.11.2/lib/python3.11/site-packages/httpcore/_exceptions.py", line 14, in map_exceptions
    raise to_exc(exc) from exc
httpcore.ReadError

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/Users/prrao/.pyenv/versions/3.11.2/lib/python3.11/site-packages/meilisearch_python_async/_http_requests.py", line 43, in _send_request
    response = await http_method(path, json=body, headers=headers)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/prrao/.pyenv/versions/3.11.2/lib/python3.11/site-packages/httpx/_client.py", line 1885, in put
    return await self.request(
           ^^^^^^^^^^^^^^^^^^^
  File "/Users/prrao/.pyenv/versions/3.11.2/lib/python3.11/site-packages/httpx/_client.py", line 1530, in request
    return await self.send(request, auth=auth, follow_redirects=follow_redirects)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/prrao/.pyenv/versions/3.11.2/lib/python3.11/site-packages/httpx/_client.py", line 1617, in send
    response = await self._send_handling_auth(
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/prrao/.pyenv/versions/3.11.2/lib/python3.11/site-packages/httpx/_client.py", line 1645, in _send_handling_auth
    response = await self._send_handling_redirects(
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/prrao/.pyenv/versions/3.11.2/lib/python3.11/site-packages/httpx/_client.py", line 1682, in _send_handling_redirects
    response = await self._send_single_request(request)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/prrao/.pyenv/versions/3.11.2/lib/python3.11/site-packages/httpx/_client.py", line 1719, in _send_single_request
    response = await transport.handle_async_request(request)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/prrao/.pyenv/versions/3.11.2/lib/python3.11/site-packages/httpx/_transports/default.py", line 352, in handle_async_request
    with map_httpcore_exceptions():
  File "/Users/prrao/.pyenv/versions/3.11.2/lib/python3.11/contextlib.py", line 155, in __exit__
    self.gen.throw(typ, value, traceback)
  File "/Users/prrao/.pyenv/versions/3.11.2/lib/python3.11/site-packages/httpx/_transports/default.py", line 77, in map_httpcore_exceptions
    raise mapped_exc(message) from exc
httpx.ReadError

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/Users/prrao/.pyenv/versions/3.11.2/lib/python3.11/site-packages/tqdm/asyncio.py", line 76, in wrap_awaitable
    return i, await f
              ^^^^^^^
  File "/Users/prrao/.pyenv/versions/3.11.2/lib/python3.11/site-packages/meilisearch_python_async/index.py", line 948, in update_documents_in_batches
    return await gather(*batches)
           ^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/prrao/.pyenv/versions/3.11.2/lib/python3.11/site-packages/meilisearch_python_async/index.py", line 904, in update_documents
    response = await self._http_requests.put(url, documents)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/prrao/.pyenv/versions/3.11.2/lib/python3.11/site-packages/meilisearch_python_async/_http_requests.py", line 73, in put
    return await self._send_request(self.http_client.put, path, body, content_type)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/prrao/.pyenv/versions/3.11.2/lib/python3.11/site-packages/meilisearch_python_async/_http_requests.py", line 55, in _send_request
    raise MeilisearchError(str(err))  # pragma: no cover
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
meilisearch_python_async.errors.MeilisearchError: MeilisearchError. Error message: .
prrao87 commented 1 year ago

My initial hunch was that the larger the batch size, the more data that's concurrently being processed by Rust on the Meilisearch side, and considering it's all multi-threaded + async Rust, there's only so many processes that can be handled with the given amount of memory. It could explain why there are absolutely no issues with smaller batch sizes.

sanders41 commented 1 year ago

The stack trace does make it look like it's on the Meilisearch side. My first thought was maybe it was sending too many requests at once, but a larger batch size would actually send less requests. You are probably correct in your hunch.

Will you go ahead and open an issue? You can just reference this discussion, no need for an MRE. I may not be able to do anything about the issue itself, but I'm wondering if there is a way I could use exception groups in Python 3.11+ to give a better error message without slowing things down and making it messy trying to figure out which Python version is being used.

Really I'm not sure if exceptioins groups could give a better message or not, I haven't used them yet since in all my libraries I'm supporting 3.8+. I figure it's at least worth looking into though.