milvus-io / pymilvus

Python SDK for Milvus.
Apache License 2.0
986 stars 313 forks source link

[Bug]: gRPC handler is reraising Exceptions in a way that breaks low-level Python Exception tracking #2221

Open bcalvert-graft opened 1 month ago

bcalvert-graft commented 1 month ago

Is there an existing issue for this?

Describe the bug

Hello Milvus maintainers,

Thank you in advance for reading my bug-report.

There are three spots in grpc_handler.py (first spot, second spot and third spot) that use the following pattern to reraise an Exception

try:
    <some code>
except Exception as err:
    <do some other stuff>
    raise err from err
...

Respectfully, this raise err from err is not a great way to reraise exceptions; instead, as mentioned in the Python docs it should be

try:
    <some code>
except Exception as err:
    <do some other stuff>
    raise
...

Why is the first one not great? It attaches the err Exception as the __cause__ to itself. As I understand it, this is not the desired semantics for __cause__ and this breaks any tools that recursively unroll chained Exceptions (e.g. for processing stack traces). As a concrete example of such a tool, we actually uncovered this seeming bug in grpc_handler.py when using Milvus as part of a larger Dask computation; specifically, as part of Dask's error-handling, the library uses a utility called collect_causes,

def collect_causes(e: BaseException) -> list[BaseException]:
    causes = []
    while e.__cause__ is not None:
        causes.append(e.__cause__)
        e = e.__cause__
    return causes

When collect_causes encounters an Exception raised from one of the linked spots in grpc_handler.py it triggers an infinite loop of unrolling e.__cause__, consequently leaking memory until the process is killed.

Assuming you agree with the quick fix, I'm happy to submit a PR, but wanted to run it by you first to make sure I'm not missing something.

Cheers, Brian

Expected Behavior

When handling Exceptions raised by the linked methods in grpc_handler.py, the Exceptions should not be attached as their own __cause__

Steps/Code To Reproduce behavior

In one process, run

In [1]: from milvus_lite.server import Server

In [2]: s = Server('test.db', '127.0.0.1:9999')

In [3]: s.start()
Out[3]: True

In [4]: s._p.wait()

In another process, run

In [1]: import pymilvus

In [2]: client = pymilvus.MilvusClient("http://localhost:9999")
   ...: collection_name = "fake_embeddings"
   ...: if client.has_collection(collection_name=collection_name):
   ...:     client.drop_collection(collection_name=collection_name)
   ...: 

In [3]: vector_dim = 512
   ...: client.create_collection(
   ...:     collection_name=collection_name,
   ...:     vector_field_name="vector",
   ...:     dimension=vector_dim,
   ...:     auto_id=True,
   ...:     enable_dynamic_field=True,
   ...:     metric_type="COSINE",
   ...: )

In [4]: import numpy as np

In [5]: vectors = np.random.normal(loc=0, scale=1, size=(vector_dim + 2,))

In [6]: try:
   ...:     client.insert(collection_name, {"vector": vectors})
   ...: except pymilvus.MilvusException as exc:
   ...:     assert exc.__cause__ is not exc, "Exception is its own cause"

The In [6]: step will throw an AssertionError

Environment details

- No hardware/software conditions of note
- Installed pymilvus with pip (version 2.4.4)
- Can reproduce the error with milvus_lite module, so I don't think server config has an impact?

Anything else?

No response

XuanYang-cn commented 1 month ago

Hi @bcalvert-graft,

Thanks for the detailed report! Your insights are valuable. We appreciate you considering a quick fix PR.

You're absolutely right that the current behavior is counterproductive. We originally opted for raise e from e to minimize error stack depth in our catch-retry logic for network fluctuations and rate limits. However, it seems the trade-off is not worth the current issue.

XuanYang-cn commented 1 month ago

/unassign /assign @bcalvert-graft

bcalvert-graft commented 1 month ago

Thanks @XuanYang-cn for the fast reply. I'll open up a PR as soon as I can

bcalvert-graft commented 1 month ago

We originally opted for raise e from e to minimize error stack depth in our catch-retry logic for network fluctuations and rate limits.

Also, thank you for this clarification on the motivation for the raise err from err choice, I hadn't considered that aspect of it

bcalvert-graft commented 1 month ago

@XuanYang-cn and whoever else I should ping, I've opened up #2225 from a fork of the pymilvus repo. The scope of the changes in that PR ended up being larger than the three spots I flagged in the issue description above, as I wrote grep to find instances of the raise <e> from <e> pattern and found more. I had to tweak those as well to get the test-case written above to pass.

bcalvert-graft commented 9 hours ago

Hi all,

Gentle bump on this thread. As mentioned above, I've opened up #2225. Are there additional things I should do to move this bugfix forward?