milvus-io / milvus

A cloud-native vector database, storage for next generation AI applications
https://milvus.io
Apache License 2.0
30.29k stars 2.91k forks source link

[Bug]: flush all failed after milvus already gets back from many chaos test #33540

Closed zhuwenxing closed 4 months ago

zhuwenxing commented 5 months ago

Is there an existing issue for this?

Environment

- Milvus version:master-20240531-2c7bb0b8-amd64
- Deployment mode(standalone or cluster):cluster
- MQ type(rocksmq, pulsar or kafka): all   
- SDK version(e.g. pymilvus v2.0.0rc2):
- OS(Ubuntu or CentOS): 
- CPU/Memory: 
- GPU: 
- Others:

Current Behavior


[2024-06-02T23:26:00.847Z] [2024-06-02 23:16:32 - DEBUG - ci_test]: (api_request)  : [Connections.connect] args: ['default', '', '', 'default', ''], kwargs: {'host': '10.255.102.75', 'port': 19530} (api_request.py:62)

[2024-06-02T23:26:00.847Z] [2024-06-02 23:16:32 - DEBUG - ci_test]: (api_response) : None  (api_request.py:37)

[2024-06-02T23:26:00.847Z] [2024-06-02 23:16:32 - DEBUG - ci_test]: (api_request)  : [Connections.has_connection] args: ['default'], kwargs: {} (api_request.py:62)

[2024-06-02T23:26:00.847Z] [2024-06-02 23:16:32 - DEBUG - ci_test]: (api_response) : True  (api_request.py:37)

[2024-06-02T23:26:00.847Z] [2024-06-02 23:16:32 - DEBUG - ci_test]: (api_request)  : [Collection] args: ['Checker__bEn31jmW', {'auto_id': False, 'description': '', 'fields': [{'name': 'int64', 'description': '', 'type': <DataType.INT64: 5>, 'is_primary': True, 'auto_id': False}, {'name': 'float', 'description': '', 'type': <DataType.FLOAT: 10>}, {'name': 'varchar', 'description': '', 'type': <DataType......, kwargs: {'consistency_level': 'Strong'} (api_request.py:62)

[2024-06-02T23:26:00.847Z] [2024-06-02 23:16:32 - DEBUG - ci_test]: (api_response) : <Collection>:

[2024-06-02T23:26:00.847Z] -------------

[2024-06-02T23:26:00.847Z] <name>: Checker__bEn31jmW

[2024-06-02T23:26:00.847Z] <description>: 

[2024-06-02T23:26:00.847Z] <schema>: {'auto_id': False, 'description': '', 'fields': [{'name': 'int64', 'description': '', 'type': <DataType.INT64: 5>, 'is_primary': True, 'auto_id': False}, {'name': 'float', 'description': '', 'type': <DataType.FLOAT: 10>}......  (api_request.py:37)

[2024-06-02T23:26:00.847Z] [2024-06-02 23:16:32 - DEBUG - ci_test]: (api_request)  : [Collection.flush] args: [], kwargs: {'timeout': 180} (api_request.py:62)

[2024-06-02T23:26:00.847Z] [2024-06-02 23:19:33 - WARNING - pymilvus.decorators]: Retry timeout: 180s (decorators.py:105)

[2024-06-02T23:26:00.847Z] [2024-06-02 23:19:33 - ERROR - pymilvus.decorators]: RPC error: [flush], <MilvusException: (code=1, message=Retry timeout: 180s, message=wait for flush timeout, collection: Checker__bEn31jmW, flusht_ts: 450198291857539073)>, <Time:{'RPC start': '2024-06-02 23:16:32.904152', 'RPC error': '2024-06-02 23:19:33.021434'}> (decorators.py:139)

[2024-06-02T23:26:00.847Z] [2024-06-02 23:19:33 - ERROR - ci_test]: Traceback (most recent call last):

[2024-06-02T23:26:00.847Z]   File "/home/jenkins/agent/workspace/tests/python_client/utils/api_request.py", line 32, in inner_wrapper

[2024-06-02T23:26:00.847Z]     res = func(*args, **_kwargs)

[2024-06-02T23:26:00.847Z]   File "/home/jenkins/agent/workspace/tests/python_client/utils/api_request.py", line 63, in api_request

[2024-06-02T23:26:00.847Z]     return func(*arg, **kwargs)

[2024-06-02T23:26:00.848Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/orm/collection.py", line 319, in flush

[2024-06-02T23:26:00.848Z]     conn.flush([self.name], timeout=timeout, **kwargs)

[2024-06-02T23:26:00.848Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/decorators.py", line 140, in handler

[2024-06-02T23:26:00.848Z]     raise e from e

[2024-06-02T23:26:00.848Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/decorators.py", line 136, in handler

[2024-06-02T23:26:00.848Z]     return func(*args, **kwargs)

[2024-06-02T23:26:00.848Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/decorators.py", line 175, in handler

[2024-06-02T23:26:00.848Z]     return func(self, *args, **kwargs)

[2024-06-02T23:26:00.848Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/decorators.py", line 106, in handler

[2024-06-02T23:26:00.848Z]     raise MilvusException(

[2024-06-02T23:26:00.848Z] pymilvus.exceptions.MilvusException: <MilvusException: (code=1, message=Retry timeout: 180s, message=wait for flush timeout, collection: Checker__bEn31jmW, flusht_ts: 450198291857539073)>

[2024-06-02T23:26:00.848Z]  (api_request.py:45)

Expected Behavior

No response

Steps To Reproduce

No response

Milvus Log

failed job: https://qa-jenkins.milvus.io/blue/organizations/jenkins/chaos-test-kafka-cron/detail/chaos-test-kafka-cron/14576/pipeline log: artifacts-etcd-followers-pod-failure-14576-server-logs.tar.gz


[2024-06-02T23:14:37.219Z] + kubectl get pods -o wide

[2024-06-02T23:14:37.221Z] + grep etcd-followers-pod-failure-14576

[2024-06-02T23:14:37.476Z] etcd-followers-pod-failure-14576-0                                1/1     Running       2 (17m ago)        44m     10.104.17.97    4am-node23   <none>           <none>

[2024-06-02T23:14:37.476Z] etcd-followers-pod-failure-14576-1                                1/1     Running       0                  44m     10.104.27.34    4am-node31   <none>           <none>

[2024-06-02T23:14:37.476Z] etcd-followers-pod-failure-14576-2                                1/1     Running       0                  44m     10.104.25.223   4am-node30   <none>           <none>

[2024-06-02T23:14:37.476Z] etcd-followers-pod-failure-14576-kafka-0                          2/2     Running       2 (43m ago)        44m     10.104.17.88    4am-node23   <none>           <none>

[2024-06-02T23:14:37.476Z] etcd-followers-pod-failure-14576-kafka-1                          2/2     Running       2 (43m ago)        44m     10.104.30.91    4am-node38   <none>           <none>

[2024-06-02T23:14:37.476Z] etcd-followers-pod-failure-14576-kafka-2                          2/2     Running       2 (43m ago)        44m     10.104.19.101   4am-node28   <none>           <none>

[2024-06-02T23:14:37.476Z] etcd-followers-pod-failure-14576-kafka-exporter-86cb4f8b9bkb5sz   1/1     Running       5 (42m ago)        44m     10.104.4.33     4am-node11   <none>           <none>

[2024-06-02T23:14:37.476Z] etcd-followers-pod-failure-14576-milvus-datacoord-58cd8d5fwjxcz   1/1     Running       4 (42m ago)        44m     10.104.14.77    4am-node18   <none>           <none>

[2024-06-02T23:14:37.476Z] etcd-followers-pod-failure-14576-milvus-datanode-c57d5b4c-chk52   1/1     Running       3 (43m ago)        44m     10.104.1.46     4am-node10   <none>           <none>

[2024-06-02T23:14:37.476Z] etcd-followers-pod-failure-14576-milvus-datanode-c57d5b4c-xct5m   1/1     Running       4 (42m ago)        44m     10.104.14.76    4am-node18   <none>           <none>

[2024-06-02T23:14:37.476Z] etcd-followers-pod-failure-14576-milvus-indexcoord-6d86598mltjj   1/1     Running       0                  44m     10.104.14.75    4am-node18   <none>           <none>

[2024-06-02T23:14:37.476Z] etcd-followers-pod-failure-14576-milvus-indexnode-847c4b9bj5sw2   1/1     Running       1 (42m ago)        44m     10.104.13.104   4am-node16   <none>           <none>

[2024-06-02T23:14:37.476Z] etcd-followers-pod-failure-14576-milvus-indexnode-847c4b9bn69qw   1/1     Running       4 (42m ago)        44m     10.104.20.80    4am-node22   <none>           <none>

[2024-06-02T23:14:37.476Z] etcd-followers-pod-failure-14576-milvus-indexnode-847c4b9bpz7lj   1/1     Running       4 (43m ago)        44m     10.104.18.248   4am-node25   <none>           <none>

[2024-06-02T23:14:37.476Z] etcd-followers-pod-failure-14576-milvus-proxy-74fdd457f6-8dpff    1/1     Running       4 (43m ago)        44m     10.104.17.66    4am-node23   <none>           <none>

[2024-06-02T23:14:37.476Z] etcd-followers-pod-failure-14576-milvus-querycoord-68b8975vskfx   1/1     Running       4 (42m ago)        44m     10.104.14.74    4am-node18   <none>           <none>

[2024-06-02T23:14:37.476Z] etcd-followers-pod-failure-14576-milvus-querynode-57c449456f9b4   1/1     Running       4 (42m ago)        44m     10.104.23.7     4am-node27   <none>           <none>

[2024-06-02T23:14:37.476Z] etcd-followers-pod-failure-14576-milvus-querynode-57c44945f72hc   1/1     Running       4 (42m ago)        44m     10.104.34.118   4am-node37   <none>           <none>

[2024-06-02T23:14:37.476Z] etcd-followers-pod-failure-14576-milvus-querynode-57c44945s28nh   1/1     Running       4 (42m ago)        44m     10.104.15.167   4am-node20   <none>           <none>

[2024-06-02T23:14:37.476Z] etcd-followers-pod-failure-14576-milvus-rootcoord-7466f8c4rj92v   1/1     Running       4 (42m ago)        44m     10.104.26.125   4am-node32   <none>           <none>

[2024-06-02T23:14:37.476Z] etcd-followers-pod-failure-14576-minio-0                          1/1     Running       0                  44m     10.104.27.30    4am-node31   <none>           <none>

[2024-06-02T23:14:37.476Z] etcd-followers-pod-failure-14576-minio-1                          1/1     Running       0                  44m     10.104.17.98    4am-node23   <none>           <none>

[2024-06-02T23:14:37.476Z] etcd-followers-pod-failure-14576-minio-2                          1/1     Running       0                  44m     10.104.26.131   4am-node32   <none>           <none>

[2024-06-02T23:14:37.476Z] etcd-followers-pod-failure-14576-minio-3                          1/1     Running       0                  44m     10.104.19.102   4am-node28   <none>           <none>

[2024-06-02T23:14:37.476Z] etcd-followers-pod-failure-14576-zookeeper-0                      1/1     Running       0                  44m     10.104.27.32    4am-node31   <none>           <none>

[2024-06-02T23:14:37.476Z] etcd-followers-pod-failure-14576-zookeeper-1                      1/1     Running       0                  44m     10.104.17.99    4am-node23   <none>           <none>

[2024-06-02T23:14:37.476Z] etcd-followers-pod-failure-14576-zookeeper-2                      1/1     Running       0                  44m     10.104.26.132   4am-node32   <none>           <none>

Anything else?

almost all chaos. test for master image all failed due to flush timeout image

zhuwenxing commented 5 months ago

/assign @XuanYang-cn PTAL

XuanYang-cn commented 5 months ago

/assign @xiaocai2333

XuanYang-cn commented 5 months ago

image

XuanYang-cn commented 5 months ago

Flush failed for L0 segments added back into metacache, those L0 segment will never get flushed and block flush forever

xiaocai2333 commented 5 months ago

fixed, please verify. /assign @zhuwenxing

XuanYang-cn commented 5 months ago

/unassign

zhuwenxing commented 4 months ago

it is still reproduced in master-20240612-9d4535ce-amd64

failed job: https://qa-jenkins.milvus.io/blue/organizations/jenkins/chaos-test-kafka-cron/detail/chaos-test-kafka-cron/14838/pipeline log: artifacts-querynode-pod-kill-14838-server-logs.tar.gz

This is one of many failed tests.

zhuwenxing commented 4 months ago

/unassign

xiaocai2333 commented 4 months ago

/assign @zhuwenxing please verify again.

zhuwenxing commented 4 months ago

image tag: master-20240619-7b9462c0-amd64

The current situation has seen some improvement, but the issue has not been fully resolved. Previously, almost all collections would fail to flush, and now some collections still fail to flush.

failed job: https://qa-jenkins.milvus.io/blue/organizations/jenkins/chaos-test-cron/detail/chaos-test-cron/15373/pipeline log: artifacts-proxy-pod-failure-15373-server-logs.tar.gz

[2024-06-19T19:52:25.998Z] [2024-06-19 19:49:21 - DEBUG - ci_test]: (api_response) : <Collection>:

[2024-06-19T19:52:25.998Z] -------------

[2024-06-19T19:52:25.998Z] <name>: InsertChecker__ia1cSkY2

[2024-06-19T19:52:25.999Z] <description>: 

[2024-06-19T19:52:25.999Z] <schema>: {'auto_id': False, 'description': '', 'fields': [{'name': 'int64', 'description': '', 'type': <DataType.INT64: 5>, 'is_primary': True, 'auto_id': False}, {'name': 'float', 'description': '', 'type': <DataType.FLOAT......  (api_request.py:37)

[2024-06-19T19:52:25.999Z] [2024-06-19 19:49:21 - DEBUG - ci_test]: (api_request)  : [Collection.flush] args: [], kwargs: {'timeout': 180} (api_request.py:62)

[2024-06-19T19:52:25.999Z] [2024-06-19 19:52:21 - WARNING - pymilvus.decorators]: Retry timeout: 180s (decorators.py:105)

[2024-06-19T19:52:25.999Z] [2024-06-19 19:52:21 - ERROR - pymilvus.decorators]: RPC error: [flush], <MilvusException: (code=1, message=Retry timeout: 180s, message=wait for flush timeout, collection: InsertChecker__ia1cSkY2, flusht_ts: 450580070212829196)>, <Time:{'RPC start': '2024-06-19 19:49:21.823172', 'RPC error': '2024-06-19 19:52:21.943708'}> (decorators.py:139)

[2024-06-19T19:52:25.999Z] [2024-06-19 19:52:21 - ERROR - ci_test]: Traceback (most recent call last):

[2024-06-19T19:52:25.999Z]   File "/home/jenkins/agent/workspace/tests/python_client/utils/api_request.py", line 32, in inner_wrapper

[2024-06-19T19:52:25.999Z]     res = func(*args, **_kwargs)

[2024-06-19T19:52:25.999Z]   File "/home/jenkins/agent/workspace/tests/python_client/utils/api_request.py", line 63, in api_request

[2024-06-19T19:52:25.999Z]     return func(*arg, **kwargs)

[2024-06-19T19:52:25.999Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/orm/collection.py", line 319, in flush

[2024-06-19T19:52:25.999Z]     conn.flush([self.name], timeout=timeout, **kwargs)

[2024-06-19T19:52:25.999Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/decorators.py", line 140, in handler

[2024-06-19T19:52:25.999Z]     raise e from e

[2024-06-19T19:52:25.999Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/decorators.py", line 136, in handler

[2024-06-19T19:52:25.999Z]     return func(*args, **kwargs)

[2024-06-19T19:52:25.999Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/decorators.py", line 175, in handler

[2024-06-19T19:52:25.999Z]     return func(self, *args, **kwargs)

[2024-06-19T19:52:25.999Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/decorators.py", line 106, in handler

[2024-06-19T19:52:25.999Z]     raise MilvusException(

[2024-06-19T19:52:25.999Z] pymilvus.exceptions.MilvusException: <MilvusException: (code=1, message=Retry timeout: 180s, message=wait for flush timeout, collection: InsertChecker__ia1cSkY2, flusht_ts: 450580070212829196)>

[2024-06-19T19:52:25.999Z]  (api_request.py:45)

[2024-06-19T19:52:25.999Z] [2024-06-19 19:52:21 - ERROR - ci_test]: (api_response) : <MilvusException: (code=1, message=Retry timeout: 180s, message=wait for flush timeout, collection: InsertChecker__ia1cSkY2, flusht_ts: 450580070212829196)> (api_request.py:46)

[2024-06-19T19:52:25.999Z] ------------- generated html file: file:///tmp/ci_logs/report.html -------------

[2024-06-19T19:52:25.999Z] =========================== short test summary info ============================

[2024-06-19T19:52:25.999Z] FAILED testcases/test_all_collections_after_chaos.py::TestAllCollection::test_milvus_default[QueryChecker__CVwOkPk7] - AssertionError: Response of API flush expect True, but got False

[2024-06-19T19:52:25.999Z] FAILED testcases/test_all_collections_after_chaos.py::TestAllCollection::test_milvus_default[InsertChecker__ia1cSkY2] - AssertionError: Response of API flush expect True, but got False
zhuwenxing commented 4 months ago

/unassign

@xiaocai2333 PTAL

xiaocai2333 commented 4 months ago

If the segment has just been created and doesn't have a stats log yet, it won't be include in the SyncSegments view. Fix this immediately.

zhuwenxing commented 4 months ago

Not reproduced in master-20240625-506a9152-amd64