milvus-io / milvus

A cloud-native vector database, storage for next generation AI applications
https://milvus.io
Apache License 2.0
31.02k stars 2.95k forks source link

[Bug]: [one pod standalone]when Milvus recovers from pod kill chaos, most of its interfaces are not available #30314

Open zhuwenxing opened 10 months ago

zhuwenxing commented 10 months ago

Is there an existing issue for this?

Environment

- Milvus version:master-20240126-7ced0af1-amd64
- Deployment mode(standalone or cluster):standalone
- MQ type(rocksmq, pulsar or kafka):    
- SDK version(e.g. pymilvus v2.0.0rc2):
- OS(Ubuntu or CentOS): 
- CPU/Memory: 
- GPU: 
- Others:

Current Behavior

  1. all search/query failed:collection not loaded
  2. flush all failed:

[2024-01-26T07:50:39.687Z] : Hello_Milvus

[2024-01-26T07:50:39.687Z] :

[2024-01-26T07:50:39.687Z] : {'auto_id': False, 'description': '', 'fields': [{'name': 'int64', 'description': '', 'type': <DataType.INT64: 5>, 'is_primary': True, 'auto_id': False}, {'name': 'float', 'description': '', 'type': <DataType.FLOAT: 10>}, {'n...... (api_request.py:37)

[2024-01-26T07:50:39.687Z] [2024-01-26 07:47:32 - DEBUG - ci_test]: (api_request) : [Collection.flush] args: [], kwargs: {'timeout': 180} (api_request.py:62)

[2024-01-26T07:50:39.687Z] [2024-01-26 07:50:32 - ERROR - pymilvus.decorators]: RPC error: [flush], <MilvusException: (code=500, message=channel not found[channel=by-dev-rootcoord-dml_2_447284044378545664v0])>, <Time:{'RPC start': '2024-01-26 07:47:32.703287', 'RPC error': '2024-01-26 07:50:32.649750'}> (decorators.py:134)

[2024-01-26T07:50:39.687Z] [2024-01-26 07:50:32 - ERROR - ci_test]: Traceback (most recent call last):

[2024-01-26T07:50:39.687Z] File "/home/jenkins/agent/workspace/tests/python_client/utils/api_request.py", line 32, in inner_wrapper

[2024-01-26T07:50:39.688Z] res = func(*args, **_kwargs)

[2024-01-26T07:50:39.688Z] File "/home/jenkins/agent/workspace/tests/python_client/utils/api_request.py", line 63, in api_request

[2024-01-26T07:50:39.688Z] return func(*arg, **kwargs)

[2024-01-26T07:50:39.688Z] File "/usr/local/lib/python3.8/dist-packages/pymilvus/orm/collection.py", line 314, in flush

[2024-01-26T07:50:39.688Z] conn.flush([self.name], timeout=timeout, **kwargs)

[2024-01-26T07:50:39.688Z] File "/usr/local/lib/python3.8/dist-packages/pymilvus/decorators.py", line 135, in handler

[2024-01-26T07:50:39.688Z] raise e from e

[2024-01-26T07:50:39.688Z] File "/usr/local/lib/python3.8/dist-packages/pymilvus/decorators.py", line 131, in handler

[2024-01-26T07:50:39.688Z] return func(*args, **kwargs)

[2024-01-26T07:50:39.688Z] File "/usr/local/lib/python3.8/dist-packages/pymilvus/decorators.py", line 170, in handler

[2024-01-26T07:50:39.688Z] return func(self, *args, **kwargs)

[2024-01-26T07:50:39.688Z] File "/usr/local/lib/python3.8/dist-packages/pymilvus/decorators.py", line 110, in handler

[2024-01-26T07:50:39.688Z] raise e from e

[2024-01-26T07:50:39.688Z] File "/usr/local/lib/python3.8/dist-packages/pymilvus/decorators.py", line 74, in handler

[2024-01-26T07:50:39.688Z] return func(*args, **kwargs)

[2024-01-26T07:50:39.688Z] File "/usr/local/lib/python3.8/dist-packages/pymilvus/client/grpc_handler.py", line 1335, in flush

[2024-01-26T07:50:39.688Z] check_status(response.status)

[2024-01-26T07:50:39.688Z] File "/usr/local/lib/python3.8/dist-packages/pymilvus/client/utils.py", line 58, in check_status

[2024-01-26T07:50:39.688Z] raise MilvusException(status.code, status.reason, status.error_code)

[2024-01-26T07:50:39.688Z] pymilvus.exceptions.MilvusException: <MilvusException: (code=500, message=channel not found[channel=by-dev-rootcoord-dml_2_447284044378545664v0])>

[2024-01-26T07:50:39.688Z] (api_request.py:45)

[2024-01-26T07:53:36.970Z] : QueryChecker__q3nvggGS

[2024-01-26T07:53:36.970Z] :

[2024-01-26T07:53:36.970Z] : {'auto_id': False, 'description': '', 'fields': [{'name': 'int64', 'description': '', 'type': <DataType.INT64: 5>, 'is_primary': True, 'auto_id': False}, {'name': 'float', 'description': '', 'type': <DataType.FLOAT:...... (api_request.py:37)

[2024-01-26T07:53:36.970Z] [2024-01-26 07:44:54 - DEBUG - ci_test]: (api_request) : [Collection.flush] args: [], kwargs: {'timeout': 180} (api_request.py:62)

[2024-01-26T07:53:36.971Z] [2024-01-26 07:47:32 - ERROR - pymilvus.decorators]: RPC error: [flush], <MilvusException: (code=65535, message=failed to flush collection 447284044378546654: etcdserver: mvcc: database space exceeded)>, <Time:{'RPC start': '2024-01-26 07:44:54.643498', 'RPC error': '2024-01-26 07:47:32.568409'}> (decorators.py:134)

[2024-01-26T07:53:36.971Z] [2024-01-26 07:47:32 - ERROR - ci_test]: Traceback (most recent call last):

[2024-01-26T07:53:36.971Z] File "/home/jenkins/agent/workspace/tests/python_client/utils/api_request.py", line 32, in inner_wrapper

[2024-01-26T07:53:36.971Z] res = func(*args, **_kwargs)

[2024-01-26T07:53:36.971Z] File "/home/jenkins/agent/workspace/tests/python_client/utils/api_request.py", line 63, in api_request

[2024-01-26T07:53:36.971Z] return func(*arg, **kwargs)

[2024-01-26T07:53:36.971Z] File "/usr/local/lib/python3.8/dist-packages/pymilvus/orm/collection.py", line 314, in flush

[2024-01-26T07:53:36.971Z] conn.flush([self.name], timeout=timeout, **kwargs)

[2024-01-26T07:53:36.971Z] File "/usr/local/lib/python3.8/dist-packages/pymilvus/decorators.py", line 135, in handler

[2024-01-26T07:53:36.971Z] raise e from e

[2024-01-26T07:53:36.971Z] File "/usr/local/lib/python3.8/dist-packages/pymilvus/decorators.py", line 131, in handler

[2024-01-26T07:53:36.971Z] return func(*args, **kwargs)

[2024-01-26T07:53:36.971Z] File "/usr/local/lib/python3.8/dist-packages/pymilvus/decorators.py", line 170, in handler

[2024-01-26T07:53:36.971Z] return func(self, *args, **kwargs)

[2024-01-26T07:53:36.971Z] File "/usr/local/lib/python3.8/dist-packages/pymilvus/decorators.py", line 110, in handler

[2024-01-26T07:53:36.971Z] raise e from e

[2024-01-26T07:53:36.971Z] File "/usr/local/lib/python3.8/dist-packages/pymilvus/decorators.py", line 74, in handler

[2024-01-26T07:53:36.971Z] return func(*args, **kwargs)

[2024-01-26T07:53:36.971Z] File "/usr/local/lib/python3.8/dist-packages/pymilvus/client/grpc_handler.py", line 1335, in flush

[2024-01-26T07:53:36.971Z] check_status(response.status)

[2024-01-26T07:53:36.971Z] File "/usr/local/lib/python3.8/dist-packages/pymilvus/client/utils.py", line 58, in check_status

[2024-01-26T07:53:36.971Z] raise MilvusException(status.code, status.reason, status.error_code)

[2024-01-26T07:53:36.971Z] pymilvus.exceptions.MilvusException: <MilvusException: (code=65535, message=failed to flush collection 447284044378546654: etcdserver: mvcc: database space exceeded)>

[2024-01-26T07:53:36.971Z] (api_request.py:45)


3. create new collections failed:

[2024-01-26T07:44:41.455Z] [2024-01-26 07:44:41 - DEBUG - ci_test]: (api_request) : [Collection] args: ['e2e__6Q9S3j7j', {'auto_id': False, 'description': '', 'fields': [{'name': 'int64', 'description': '', 'type': <DataType.INT64: 5>, 'is_primary': True, 'auto_id': False}, {'name': 'float', 'description': '', 'type': <DataType.FLOAT: 10>}, {'name': 'varchar', 'description': '', 'type': <DataType.VAR......, kwargs: {'consistency_level': 'Strong'} (api_request.py:62)

[2024-01-26T07:44:41.455Z] [2024-01-26 07:44:41 - ERROR - pymilvus.decorators]: RPC error: [create_collection], <MilvusException: (code=65535, message=etcdserver: mvcc: database space exceeded)>, <Time:{'RPC start': '2024-01-26 07:44:41.083880', 'RPC error': '2024-01-26 07:44:41.086827'}> (decorators.py:134)

[2024-01-26T07:44:41.455Z] [2024-01-26 07:44:41 - ERROR - ci_test]: Traceback (most recent call last):

[2024-01-26T07:44:41.455Z] File "/home/jenkins/agent/workspace/tests/python_client/utils/api_request.py", line 32, in inner_wrapper

[2024-01-26T07:44:41.455Z] res = func(*args, **_kwargs)

[2024-01-26T07:44:41.455Z] File "/home/jenkins/agent/workspace/tests/python_client/utils/api_request.py", line 63, in api_request

[2024-01-26T07:44:41.455Z] return func(*arg, **kwargs)

[2024-01-26T07:44:41.455Z] File "/usr/local/lib/python3.8/dist-packages/pymilvus/orm/collection.py", line 147, in init

[2024-01-26T07:44:41.455Z] conn.create_collection(self._name, schema, **kwargs)

[2024-01-26T07:44:41.455Z] File "/usr/local/lib/python3.8/dist-packages/pymilvus/decorators.py", line 135, in handler

[2024-01-26T07:44:41.455Z] raise e from e

[2024-01-26T07:44:41.455Z] File "/usr/local/lib/python3.8/dist-packages/pymilvus/decorators.py", line 131, in handler

[2024-01-26T07:44:41.455Z] return func(*args, **kwargs)

[2024-01-26T07:44:41.455Z] File "/usr/local/lib/python3.8/dist-packages/pymilvus/decorators.py", line 170, in handler

[2024-01-26T07:44:41.455Z] return func(self, *args, **kwargs)

[2024-01-26T07:44:41.455Z] File "/usr/local/lib/python3.8/dist-packages/pymilvus/decorators.py", line 110, in handler

[2024-01-26T07:44:41.455Z] raise e from e

[2024-01-26T07:44:41.455Z] File "/usr/local/lib/python3.8/dist-packages/pymilvus/decorators.py", line 74, in handler

[2024-01-26T07:44:41.455Z] return func(*args, **kwargs)

[2024-01-26T07:44:41.455Z] File "/usr/local/lib/python3.8/dist-packages/pymilvus/client/grpc_handler.py", line 304, in create_collection

[2024-01-26T07:44:41.455Z] check_status(status)

[2024-01-26T07:44:41.455Z] File "/usr/local/lib/python3.8/dist-packages/pymilvus/client/utils.py", line 58, in check_status

[2024-01-26T07:44:41.455Z] raise MilvusException(status.code, status.reason, status.error_code)

[2024-01-26T07:44:41.455Z] pymilvus.exceptions.MilvusException: <MilvusException: (code=65535, message=etcdserver: mvcc: database space exceeded)>

[2024-01-26T07:44:41.455Z] (api_request.py:45)

[2024-01-26T07:44:41.455Z] [2024-01-26 07:44:41 - ERROR - ci_test]: (api_response) : <MilvusException: (code=65535, message=etcdserver: mvcc: database space exceeded)> (api_request.py:46)



### Expected Behavior

_No response_

### Steps To Reproduce

_No response_

### Milvus Log

failed job:https://qa-jenkins.milvus.io/blue/organizations/jenkins/chaos-test-cron/detail/chaos-test-cron/10981/pipeline

log:
[artifacts-one-pod-standalone-pod-kill-10981-server-logs.tar.gz](https://github.com/milvus-io/milvus/files/14061831/artifacts-one-pod-standalone-pod-kill-10981-server-logs.tar.gz)

### Anything else?

_No response_
zhuwenxing commented 10 months ago

/assign @LoveEachDay

PTAL

xiaofan-luan commented 8 months ago

database space exceeded seems that the etcd fails

zhuwenxing commented 8 months ago

the error message has changed in master-20240305-3c9ffded failed job: https://qa-jenkins.milvus.io/blue/organizations/jenkins/chaos-test-cron/detail/chaos-test-cron/12157/pipeline log: artifacts-one-pod-standalone-pod-kill-12157-server-logs.tar.gz


[2024-03-05T17:16:34.268Z] <name>: Hello_Milvus

[2024-03-05T17:16:34.268Z] <description>: 

[2024-03-05T17:16:34.268Z] <schema>: {'auto_id': False, 'description': '', 'fields': [{'name': 'int64', 'description': '', 'type': <DataType.INT64: 5>, 'is_primary': True, 'auto_id': False}, {'name': 'float', 'description': '', 'type': <DataType.FLOAT: 10>}, {'n......  (api_request.py:37)

[2024-03-05T17:16:34.268Z] [2024-03-05 17:13:27 - DEBUG - ci_test]: (api_request)  : [Collection.flush] args: [], kwargs: {'timeout': 180} (api_request.py:62)

[2024-03-05T17:16:34.268Z] [2024-03-05 17:16:25 - ERROR - pymilvus.decorators]: RPC error: [flush], <MilvusException: (code=500, message=channel not found[channel=by-dev-rootcoord-dml_2_448176180282400233v0])>, <Time:{'RPC start': '2024-03-05 17:13:27.030313', 'RPC error': '2024-03-05 17:16:25.634781'}> (decorators.py:134)

[2024-03-05T17:16:34.268Z] [2024-03-05 17:16:25 - ERROR - ci_test]: Traceback (most recent call last):

[2024-03-05T17:16:34.268Z]   File "/home/jenkins/agent/workspace/tests/python_client/utils/api_request.py", line 32, in inner_wrapper

[2024-03-05T17:16:34.268Z]     res = func(*args, **_kwargs)

[2024-03-05T17:16:34.268Z]   File "/home/jenkins/agent/workspace/tests/python_client/utils/api_request.py", line 63, in api_request

[2024-03-05T17:16:34.268Z]     return func(*arg, **kwargs)

[2024-03-05T17:16:34.268Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/orm/collection.py", line 314, in flush

[2024-03-05T17:16:34.268Z]     conn.flush([self.name], timeout=timeout, **kwargs)

[2024-03-05T17:16:34.268Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/decorators.py", line 135, in handler

[2024-03-05T17:16:34.268Z]     raise e from e

[2024-03-05T17:16:34.268Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/decorators.py", line 131, in handler

[2024-03-05T17:16:34.268Z]     return func(*args, **kwargs)

[2024-03-05T17:16:34.268Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/decorators.py", line 170, in handler

[2024-03-05T17:16:34.268Z]     return func(self, *args, **kwargs)

[2024-03-05T17:16:34.268Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/decorators.py", line 110, in handler

[2024-03-05T17:16:34.268Z]     raise e from e

[2024-03-05T17:16:34.268Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/decorators.py", line 74, in handler

[2024-03-05T17:16:34.268Z]     return func(*args, **kwargs)

[2024-03-05T17:16:34.268Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/client/grpc_handler.py", line 1396, in flush

[2024-03-05T17:16:34.268Z]     check_status(response.status)

[2024-03-05T17:16:34.268Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/client/utils.py", line 60, in check_status

[2024-03-05T17:16:34.268Z]     raise MilvusException(status.code, status.reason, status.error_code)

[2024-03-05T17:16:34.268Z] pymilvus.exceptions.MilvusException: <MilvusException: (code=500, message=channel not found[channel=by-dev-rootcoord-dml_2_448176180282400233v0])>

[2024-03-05T17:16:34.268Z]  (api_request.py:45)

[2024-03-05T17:16:34.268Z] [2024-03-05 17:16:25 - ERROR - ci_test]: (api_response) : <MilvusException: (code=500, message=channel not found[channel=by-dev-rootcoord-dml_2_448176180282400233v0])> (api_request.py:46)
LoveEachDay commented 8 months ago

database space exceeded seems that the etcd fails

We'd change the default etcd settings for embedding mode. @pingliu Please add the following config to embedEtcd.yaml:

quota-backend-bytes: '4294967296'
auto-compaction-mode: 'revision'
auto-compaction-retention: '1000'
zhuwenxing commented 8 months ago

still reproduced see https://github.com/milvus-io/milvus/issues/30545#issuecomment-2024296262

LoveEachDay commented 7 months ago

@LoveEachDay We'd change the auto compaction config for embed etcd which will mitigate the mvcc: database space exceeded problem.

zhuwenxing commented 7 months ago

the mvcc: database space exceeded problem was not reproduced after https://github.com/milvus-io/milvus/pull/32048

but channel not found problem was still reproduced

failed job: https://qa-jenkins.milvus.io/blue/organizations/jenkins/chaos-test-cron/detail/chaos-test-cron/13356/pipeline log: artifacts-one-pod-standalone-pod-kill-13356-server-logs.tar.gz

xiaofan-luan commented 7 months ago

/assign @weiliu1031

XuanYang-cn commented 6 months ago

/assign

zhuwenxing commented 5 months ago

failed job: https://qa-jenkins.milvus.io/blue/organizations/jenkins/chaos-test-cron/detail/chaos-test-cron/15592/pipeline log: artifacts-one-pod-standalone-pod-kill-15592-server-logs.tar.gz

Currently, all errors are due to flush failures.

zhuwenxing commented 2 months ago

still stable reproduced failed job: https://qa-jenkins.milvus.io/blue/organizations/jenkins/chaos-test-cron/detail/chaos-test-cron/18644/pipeline log: artifacts-one-pod-standalone-pod-kill-18644-server-logs.tar.gz

@XuanYang-cn