milvus-io / milvus

A cloud-native vector database, storage for next generation AI applications
https://milvus.io
Apache License 2.0
29.35k stars 2.82k forks source link

[Bug]: [rolling upgrade]Search and query still failed (about keep fail 2 mins)after injecting rootcoord pod failure chaos test when activated stanby #26572

Closed zhuwenxing closed 9 months ago

zhuwenxing commented 1 year ago

Is there an existing issue for this?

Environment

- Milvus version:master-20230823-b9bc6681
- Deployment mode(standalone or cluster):cluster
- MQ type(rocksmq, pulsar or kafka):    
- SDK version(e.g. pymilvus v2.0.0rc2):
- OS(Ubuntu or CentOS): 
- CPU/Memory: 
- GPU: 
- Others:

Current Behavior


[2023-08-23T07:46:54.948Z] chaos_commons.py:112: AssumptionFailure

[2023-08-23T07:46:54.948Z] >>   pytest.assume(

[2023-08-23T07:46:54.948Z] AssertionError: Expect Succ: Op.search succ rate 0.9482758620689655, total: 116, average time: 0.9191

[2023-08-23T07:46:54.948Z] assert False

[2023-08-23T07:46:54.948Z] 

[2023-08-23T07:46:54.948Z] chaos_commons.py:112: AssumptionFailure

[2023-08-23T07:46:54.948Z] >>   pytest.assume(

[2023-08-23T07:46:54.948Z] AssertionError: Expect Succ: Op.query succ rate 0.9482758620689655, total: 116, average time: 0.9249
[2023-08-23T07:46:54.947Z] [2023-08-23 07:46:54 - INFO - ci_test]: QueryChecker failed at [('failure', 101, '2023-08-23T07:44:56.942516Z', 1692776696.9425774), ('failure', 102, '2023-08-23T07:45:07.947312Z', 1692776707.9473467), ('success', 103, '2023-08-23T07:45:18.950397Z', 1692776718.950448), ('failure', 104, '2023-08-23T07:45:29.367998Z', 1692776729.3680367), ('failure', 105, '2023-08-23T07:45:40.371369Z', 1692776740.3714037), ('failure', 106, '2023-08-23T07:45:51.374017Z', 1692776751.3740554), ('success', 107, '2023-08-23T07:46:02.377551Z', 1692776762.377591), ('failure', 110, '2023-08-23T07:46:23.410550Z', 1692776783.4105754), ('success', 111, '2023-08-23T07:46:34.414008Z', 1692776794.4140503)] (checker.py:156)
[2023-08-23T07:46:54.946Z] [2023-08-23 07:46:54 - INFO - ci_test]: SearchChecker failed at [('failure', 96, '2023-08-23T07:44:31.680961Z', 1692776671.6810055), ('success', 97, '2023-08-23T07:44:42.687978Z', 1692776682.6880438), ('failure', 101, '2023-08-23T07:44:56.944012Z', 1692776696.9440417), ('failure', 102, '2023-08-23T07:45:07.947955Z', 1692776707.9479766), ('success', 103, '2023-08-23T07:45:18.952446Z', 1692776718.9524746), ('failure', 104, '2023-08-23T07:45:27.384912Z', 1692776727.384987), ('success', 105, '2023-08-23T07:45:38.390998Z', 1692776738.3910494), ('failure', 107, '2023-08-23T07:45:57.111973Z', 1692776757.1120214), ('success', 108, '2023-08-23T07:46:08.115960Z', 1692776768.1159952), ('failure', 110, '2023-08-23T07:46:23.409980Z', 1692776783.4100382), ('success', 111, '2023-08-23T07:46:34.415965Z', 1692776794.4159973)] (checker.py:156)

Instead of failing all the time, these requests fail and succeed crosswise

Expected Behavior

No response

Steps To Reproduce

No response

Milvus Log

failed job: https://qa-jenkins.milvus.io/blue/organizations/jenkins/standby_test_master/detail/standby_test_master/92/pipeline log: artifacts-rootcoord-pod-failure-92-pytest-logs.tar.gz artifacts-rootcoord-pod-failure-92-server-logs.tar.gz

Anything else?

  1. find activate rootcoord and stanby rootcord
    
    [2023-08-23T07:41:44.113Z] [2023-08-23 07:41:43 - INFO - ci_test]: coord_type: rootcoord, etcd session value: (b'{"ServerID":2,"ServerName":"rootcoord","Address":"10.102.10.161:53100","Exclusive":true,"TriggerKill":true,"Version":"2.3.0-pre+dev"}', <pyetcd.client.KVMetadata object at 0x7fe5d440b640>) (util_k8s.py:407)

[2023-08-23T07:41:44.113Z] [2023-08-23 07:41:43 - INFO - root]: activated pod list: ['rootcoord-pod-failure-92-milvus-rootcoord-6f559b78c4-6cgxw'], standby pod list: ['rootcoord-pod-failure-92-milvus-rootcoord-6f559b78c4-mj8rk'] (test_chaos_apply_to_coord.py:64)

2. apply pod failure chaos to activate rootcoord

[2023-08-23T07:41:44.113Z] [2023-08-23 07:41:43 - INFO - root]: chaos injected (test_chaos_apply_to_coord.py:91)

[2023-08-23T07:41:44.113Z] [2023-08-23 07:41:44 - INFO - root]: chaos inject result: PodChaos, test-rootcoord-standby-1692776503 (test_chaos_apply_to_coord.py:96)


3. rootcoord-pod-failure-92-milvus-rootcoord-6f559b78c4-6cgxw restart two more times
before chaos, some pods already restarted
![image](https://github.com/milvus-io/milvus/assets/12268675/525794d8-ec97-4da9-92d9-6a2c3d2070b8)

after chaos, only rootcoord has two more restart, other did not restart
![image](https://github.com/milvus-io/milvus/assets/12268675/9d629475-4292-4cbe-96f9-f349f2bb34c5)

4. standby rootcoord switch to active almost at the time when chaos applied
![image](https://github.com/milvus-io/milvus/assets/12268675/bbf83a0f-45ae-41cb-aa52-232adaa1ddaf)

5. search and query failed start at 2023-08-23T07:44:31.680961Z
zhuwenxing commented 1 year ago

/assign @bigsheeper

zhuwenxing commented 1 year ago

There is also an index failure

IndexChecker failed at [('failure', 3, '2023-08-23T07:43:48.309858Z', 1692776628.3099241)]
zhuwenxing commented 1 year ago

When applying pod kill chaos to active rootcoord, there is no search/query failure, but still an index failure.

failed job: https://qa-jenkins.milvus.io/blue/organizations/jenkins/standby_test_master/detail/standby_test_master/93/pipeline log: artifacts-rootcoord-pod-kill-93-pytest-logs.tar.gz

artifacts-rootcoord-pod-kill-93-server-logs.tar.gz

pod kill chaos was applied at 2023-08-23T09:03:16.372Z image

[2023-08-23T09:03:16.372Z] [2023-08-23 09:03:16 - INFO - root]: activated pod list: ['rootcoord-pod-kill-93-milvus-rootcoord-5dd5576c9b-jqklp'], standby pod list: ['rootcoord-pod-kill-93-milvus-rootcoord-5dd5576c9b-vz2mj'] (test_chaos_apply_to_coord.py:64)

[2023-08-23T09:03:16.372Z] [2023-08-23 09:03:16 - INFO - root]: chaos_config: {'apiVersion': 'chaos-mesh.org/v1alpha1', 'kind': 'PodChaos', 'metadata': {'name': 'test-rootcoord-standby-1692781396', 'namespace': 'chaos-testing'}, 'spec': {'selector': {'pods': {'chaos-testing': ['rootcoord-pod-kill-93-milvus-rootcoord-5dd5576c9b-jqklp']}}, 'mode': 'all', 'action': 'pod-kill', 'duration': '2m', 'gracePeriod': 0}} (test_chaos_apply_to_coord.py:84)

[2023-08-23T09:03:16.372Z] [2023-08-23 09:03:16 - INFO - ci_test]: create custom resource response: {'apiVersion': 'chaos-mesh.org/v1alpha1', 'kind': 'PodChaos', 'metadata': {'creationTimestamp': '2023-08-23T09:03:16Z', 'generation': 1, 'managedFields': [{'apiVersion': 'chaos-mesh.org/v1alpha1', 'fieldsType': 'FieldsV1', 'fieldsV1': {'f:spec': {'.': {}, 'f:action': {}, 'f:duration': {}, 'f:gracePeriod': {}, 'f:mode': {}, 'f:selector': {'.': {}, 'f:pods': {'.': {}, 'f:chaos-testing': {}}}}}, 'manager': 'OpenAPI-Generator', 'operation': 'Update', 'time': '2023-08-23T09:03:16Z'}], 'name': 'test-rootcoord-standby-1692781396', 'namespace': 'chaos-testing', 'resourceVersion': '360069002', 'uid': 'e3e5631d-715e-4239-b927-c0321d72177d'}, 'spec': {'action': 'pod-kill', 'duration': '2m', 'mode': 'all', 'selector': {'namespaces': ['chaos-testing'], 'pods': {'chaos-testing': ['rootcoord-pod-kill-93-milvus-rootcoord-5dd5576c9b-jqklp']}}}, 'status': {'experiment': {}}} (cus_resource_opts.py:40)

[2023-08-23T09:03:16.372Z] [2023-08-23 09:03:16 - INFO - root]: chaos injected (test_chaos_apply_to_coord.py:91)
[2023-08-23T09:04:43.111Z] [2023-08-23 09:04:42 - INFO - ci_test]: IndexChecker failed at [('failure', 7, '2023-08-23T09:02:22.099983Z', 1692781342.1000383)] (checker.py:156)
yanliang567 commented 1 year ago

/unassign

bigsheeper commented 1 year ago

When applying pod kill chaos to active rootcoord, there is no search/query failure, but still an index failure.

failed job: https://qa-jenkins.milvus.io/blue/organizations/jenkins/standby_test_master/detail/standby_test_master/93/pipeline log: artifacts-rootcoord-pod-kill-93-pytest-logs.tar.gz

artifacts-rootcoord-pod-kill-93-server-logs.tar.gz

pod kill chaos was applied at 2023-08-23T09:03:16.372Z image

[2023-08-23T09:03:16.372Z] [2023-08-23 09:03:16 - INFO - root]: activated pod list: ['rootcoord-pod-kill-93-milvus-rootcoord-5dd5576c9b-jqklp'], standby pod list: ['rootcoord-pod-kill-93-milvus-rootcoord-5dd5576c9b-vz2mj'] (test_chaos_apply_to_coord.py:64)

[2023-08-23T09:03:16.372Z] [2023-08-23 09:03:16 - INFO - root]: chaos_config: {'apiVersion': 'chaos-mesh.org/v1alpha1', 'kind': 'PodChaos', 'metadata': {'name': 'test-rootcoord-standby-1692781396', 'namespace': 'chaos-testing'}, 'spec': {'selector': {'pods': {'chaos-testing': ['rootcoord-pod-kill-93-milvus-rootcoord-5dd5576c9b-jqklp']}}, 'mode': 'all', 'action': 'pod-kill', 'duration': '2m', 'gracePeriod': 0}} (test_chaos_apply_to_coord.py:84)

[2023-08-23T09:03:16.372Z] [2023-08-23 09:03:16 - INFO - ci_test]: create custom resource response: {'apiVersion': 'chaos-mesh.org/v1alpha1', 'kind': 'PodChaos', 'metadata': {'creationTimestamp': '2023-08-23T09:03:16Z', 'generation': 1, 'managedFields': [{'apiVersion': 'chaos-mesh.org/v1alpha1', 'fieldsType': 'FieldsV1', 'fieldsV1': {'f:spec': {'.': {}, 'f:action': {}, 'f:duration': {}, 'f:gracePeriod': {}, 'f:mode': {}, 'f:selector': {'.': {}, 'f:pods': {'.': {}, 'f:chaos-testing': {}}}}}, 'manager': 'OpenAPI-Generator', 'operation': 'Update', 'time': '2023-08-23T09:03:16Z'}], 'name': 'test-rootcoord-standby-1692781396', 'namespace': 'chaos-testing', 'resourceVersion': '360069002', 'uid': 'e3e5631d-715e-4239-b927-c0321d72177d'}, 'spec': {'action': 'pod-kill', 'duration': '2m', 'mode': 'all', 'selector': {'namespaces': ['chaos-testing'], 'pods': {'chaos-testing': ['rootcoord-pod-kill-93-milvus-rootcoord-5dd5576c9b-jqklp']}}}, 'status': {'experiment': {}}} (cus_resource_opts.py:40)

[2023-08-23T09:03:16.372Z] [2023-08-23 09:03:16 - INFO - root]: chaos injected (test_chaos_apply_to_coord.py:91)
[2023-08-23T09:04:43.111Z] [2023-08-23 09:04:42 - INFO - ci_test]: IndexChecker failed at [('failure', 7, '2023-08-23T09:02:22.099983Z', 1692781342.1000383)] (checker.py:156)

create index timeout: image

indexNode is executing other index tasks, please allocate more resources to indexNode or start more indexNodes. image

https://grafana-4am.zilliz.cc/explore?orgId=1&left=%7B%22datasource%22:%22Loki%22,%22queries%22:%5B%7B%22refId%22:%22C%22,%22expr%22:%22%7Bpod%3D~%5C%22rootcoord-pod-kill-93-milvus-datacoord-559c498597-h8bwm%5C%22%7D%20!~%20%5C%22node%20is%20not%20matched%20with%20channel%5C%22%20%7C~%20%5C%22WARN%7CERROR%7Cetcd%7C2379%7Cindex%20builder%20peek%20client%20error,%20there%20is%20no%20available%5C%22%22,%22hide%22:false%7D%5D,%22range%22:%7B%22from%22:%221692781335000%22,%22to%22:%221692781463000%22%7D%7D

bigsheeper commented 1 year ago

Is there an existing issue for this?

  • [x] I have searched the existing issues

Environment

- Milvus version:master-20230823-b9bc6681
- Deployment mode(standalone or cluster):cluster
- MQ type(rocksmq, pulsar or kafka):    
- SDK version(e.g. pymilvus v2.0.0rc2):
- OS(Ubuntu or CentOS): 
- CPU/Memory: 
- GPU: 
- Others:

Current Behavior


[2023-08-23T07:46:54.948Z] chaos_commons.py:112: AssumptionFailure

[2023-08-23T07:46:54.948Z] >> pytest.assume(

[2023-08-23T07:46:54.948Z] AssertionError: Expect Succ: Op.search succ rate 0.9482758620689655, total: 116, average time: 0.9191

[2023-08-23T07:46:54.948Z] assert False

[2023-08-23T07:46:54.948Z] 

[2023-08-23T07:46:54.948Z] chaos_commons.py:112: AssumptionFailure

[2023-08-23T07:46:54.948Z] >> pytest.assume(

[2023-08-23T07:46:54.948Z] AssertionError: Expect Succ: Op.query succ rate 0.9482758620689655, total: 116, average time: 0.9249
[2023-08-23T07:46:54.947Z] [2023-08-23 07:46:54 - INFO - ci_test]: QueryChecker failed at [('failure', 101, '2023-08-23T07:44:56.942516Z', 1692776696.9425774), ('failure', 102, '2023-08-23T07:45:07.947312Z', 1692776707.9473467), ('success', 103, '2023-08-23T07:45:18.950397Z', 1692776718.950448), ('failure', 104, '2023-08-23T07:45:29.367998Z', 1692776729.3680367), ('failure', 105, '2023-08-23T07:45:40.371369Z', 1692776740.3714037), ('failure', 106, '2023-08-23T07:45:51.374017Z', 1692776751.3740554), ('success', 107, '2023-08-23T07:46:02.377551Z', 1692776762.377591), ('failure', 110, '2023-08-23T07:46:23.410550Z', 1692776783.4105754), ('success', 111, '2023-08-23T07:46:34.414008Z', 1692776794.4140503)] (checker.py:156)
[2023-08-23T07:46:54.946Z] [2023-08-23 07:46:54 - INFO - ci_test]: SearchChecker failed at [('failure', 96, '2023-08-23T07:44:31.680961Z', 1692776671.6810055), ('success', 97, '2023-08-23T07:44:42.687978Z', 1692776682.6880438), ('failure', 101, '2023-08-23T07:44:56.944012Z', 1692776696.9440417), ('failure', 102, '2023-08-23T07:45:07.947955Z', 1692776707.9479766), ('success', 103, '2023-08-23T07:45:18.952446Z', 1692776718.9524746), ('failure', 104, '2023-08-23T07:45:27.384912Z', 1692776727.384987), ('success', 105, '2023-08-23T07:45:38.390998Z', 1692776738.3910494), ('failure', 107, '2023-08-23T07:45:57.111973Z', 1692776757.1120214), ('success', 108, '2023-08-23T07:46:08.115960Z', 1692776768.1159952), ('failure', 110, '2023-08-23T07:46:23.409980Z', 1692776783.4100382), ('success', 111, '2023-08-23T07:46:34.415965Z', 1692776794.4159973)] (checker.py:156)

Instead of failing all the time, these requests fail and succeed crosswise

Expected Behavior

No response

Steps To Reproduce

No response

Milvus Log

failed job: https://qa-jenkins.milvus.io/blue/organizations/jenkins/standby_test_master/detail/standby_test_master/92/pipeline log: artifacts-rootcoord-pod-failure-92-pytest-logs.tar.gz artifacts-rootcoord-pod-failure-92-server-logs.tar.gz

Anything else?

  1. find activate rootcoord and stanby rootcord
[2023-08-23T07:41:44.113Z] [2023-08-23 07:41:43 - INFO - ci_test]: coord_type: rootcoord, etcd session value: (b'{"ServerID":2,"ServerName":"rootcoord","Address":"10.102.10.161:53100","Exclusive":true,"TriggerKill":true,"Version":"2.3.0-pre+dev"}', <pyetcd.client.KVMetadata object at 0x7fe5d440b640>) (util_k8s.py:407)

[2023-08-23T07:41:44.113Z] [2023-08-23 07:41:43 - INFO - root]: activated pod list: ['rootcoord-pod-failure-92-milvus-rootcoord-6f559b78c4-6cgxw'], standby pod list: ['rootcoord-pod-failure-92-milvus-rootcoord-6f559b78c4-mj8rk'] (test_chaos_apply_to_coord.py:64)
  1. apply pod failure chaos to activate rootcoord
[2023-08-23T07:41:44.113Z] [2023-08-23 07:41:43 - INFO - root]: chaos injected (test_chaos_apply_to_coord.py:91)

[2023-08-23T07:41:44.113Z] [2023-08-23 07:41:44 - INFO - root]: chaos inject result: PodChaos, test-rootcoord-standby-1692776503 (test_chaos_apply_to_coord.py:96)
  1. rootcoord-pod-failure-92-milvus-rootcoord-6f559b78c4-6cgxw restart two more times before chaos, some pods already restarted image

after chaos, only rootcoord has two more restart, other did not restart image

  1. standby rootcoord switch to active almost at the time when chaos applied image
  2. search and query failed start at 2023-08-23T07:44:31.680961Z

ServerID mismatch, successfully reset the connection: image

https://grafana-4am.zilliz.cc/explore?orgId=1&left=%7B%22datasource%22:%22Loki%22,%22queries%22:%5B%7B%22refId%22:%22A%22,%22expr%22:%22%7Bpod%3D~%5C%22rootcoord-pod-failure-92-milvus-.*%5C%22%7D%20%7C%3D%20%5C%22grpcclient%2Fclient.go%5C%22%22,%22hide%22:false%7D,%7B%22refId%22:%22B%22,%22expr%22:%22%7Bpod%3D~%5C%22rootcoord-pod-failure-92-milvus-rootcoord-6f559b78c4-6cgxw%5C%22%7D%20%7C~%20%5C%22quit%20STANDBY%20mode,%20this%20node%20will%20become%20ACTIVE%7Center%20standby%20mode%20successfully%7Cregister%20ACTIVE%7CGitCommit%5C%22%20%22,%22hide%22:true%7D,%7B%22refId%22:%22C%22,%22expr%22:%22%7Bpod%3D~%5C%22rootcoord-pod-failure-92-milvus-rootcoord-6f559b78c4-mj8rk%5C%22%7D%20%7C~%20%5C%22quit%20STANDBY%20mode,%20this%20node%20will%20become%20ACTIVE%7Center%20standby%20mode%20successfully%7Cregister%20ACTIVE%5C%22%20%22,%22hide%22:true%7D%5D,%22range%22:%7B%22from%22:%221692776503000%22,%22to%22:%221692776513000%22%7D%7D

bigsheeper commented 1 year ago

Is there an existing issue for this?

  • [x] I have searched the existing issues

Environment

- Milvus version:master-20230823-b9bc6681
- Deployment mode(standalone or cluster):cluster
- MQ type(rocksmq, pulsar or kafka):    
- SDK version(e.g. pymilvus v2.0.0rc2):
- OS(Ubuntu or CentOS): 
- CPU/Memory: 
- GPU: 
- Others:

Current Behavior


[2023-08-23T07:46:54.948Z] chaos_commons.py:112: AssumptionFailure

[2023-08-23T07:46:54.948Z] >> pytest.assume(

[2023-08-23T07:46:54.948Z] AssertionError: Expect Succ: Op.search succ rate 0.9482758620689655, total: 116, average time: 0.9191

[2023-08-23T07:46:54.948Z] assert False

[2023-08-23T07:46:54.948Z] 

[2023-08-23T07:46:54.948Z] chaos_commons.py:112: AssumptionFailure

[2023-08-23T07:46:54.948Z] >> pytest.assume(

[2023-08-23T07:46:54.948Z] AssertionError: Expect Succ: Op.query succ rate 0.9482758620689655, total: 116, average time: 0.9249
[2023-08-23T07:46:54.947Z] [2023-08-23 07:46:54 - INFO - ci_test]: QueryChecker failed at [('failure', 101, '2023-08-23T07:44:56.942516Z', 1692776696.9425774), ('failure', 102, '2023-08-23T07:45:07.947312Z', 1692776707.9473467), ('success', 103, '2023-08-23T07:45:18.950397Z', 1692776718.950448), ('failure', 104, '2023-08-23T07:45:29.367998Z', 1692776729.3680367), ('failure', 105, '2023-08-23T07:45:40.371369Z', 1692776740.3714037), ('failure', 106, '2023-08-23T07:45:51.374017Z', 1692776751.3740554), ('success', 107, '2023-08-23T07:46:02.377551Z', 1692776762.377591), ('failure', 110, '2023-08-23T07:46:23.410550Z', 1692776783.4105754), ('success', 111, '2023-08-23T07:46:34.414008Z', 1692776794.4140503)] (checker.py:156)
[2023-08-23T07:46:54.946Z] [2023-08-23 07:46:54 - INFO - ci_test]: SearchChecker failed at [('failure', 96, '2023-08-23T07:44:31.680961Z', 1692776671.6810055), ('success', 97, '2023-08-23T07:44:42.687978Z', 1692776682.6880438), ('failure', 101, '2023-08-23T07:44:56.944012Z', 1692776696.9440417), ('failure', 102, '2023-08-23T07:45:07.947955Z', 1692776707.9479766), ('success', 103, '2023-08-23T07:45:18.952446Z', 1692776718.9524746), ('failure', 104, '2023-08-23T07:45:27.384912Z', 1692776727.384987), ('success', 105, '2023-08-23T07:45:38.390998Z', 1692776738.3910494), ('failure', 107, '2023-08-23T07:45:57.111973Z', 1692776757.1120214), ('success', 108, '2023-08-23T07:46:08.115960Z', 1692776768.1159952), ('failure', 110, '2023-08-23T07:46:23.409980Z', 1692776783.4100382), ('success', 111, '2023-08-23T07:46:34.415965Z', 1692776794.4159973)] (checker.py:156)

Instead of failing all the time, these requests fail and succeed crosswise

Expected Behavior

No response

Steps To Reproduce

No response

Milvus Log

failed job: https://qa-jenkins.milvus.io/blue/organizations/jenkins/standby_test_master/detail/standby_test_master/92/pipeline log: artifacts-rootcoord-pod-failure-92-pytest-logs.tar.gz artifacts-rootcoord-pod-failure-92-server-logs.tar.gz

Anything else?

  1. find activate rootcoord and stanby rootcord
[2023-08-23T07:41:44.113Z] [2023-08-23 07:41:43 - INFO - ci_test]: coord_type: rootcoord, etcd session value: (b'{"ServerID":2,"ServerName":"rootcoord","Address":"10.102.10.161:53100","Exclusive":true,"TriggerKill":true,"Version":"2.3.0-pre+dev"}', <pyetcd.client.KVMetadata object at 0x7fe5d440b640>) (util_k8s.py:407)

[2023-08-23T07:41:44.113Z] [2023-08-23 07:41:43 - INFO - root]: activated pod list: ['rootcoord-pod-failure-92-milvus-rootcoord-6f559b78c4-6cgxw'], standby pod list: ['rootcoord-pod-failure-92-milvus-rootcoord-6f559b78c4-mj8rk'] (test_chaos_apply_to_coord.py:64)
  1. apply pod failure chaos to activate rootcoord
[2023-08-23T07:41:44.113Z] [2023-08-23 07:41:43 - INFO - root]: chaos injected (test_chaos_apply_to_coord.py:91)

[2023-08-23T07:41:44.113Z] [2023-08-23 07:41:44 - INFO - root]: chaos inject result: PodChaos, test-rootcoord-standby-1692776503 (test_chaos_apply_to_coord.py:96)
  1. rootcoord-pod-failure-92-milvus-rootcoord-6f559b78c4-6cgxw restart two more times before chaos, some pods already restarted image

after chaos, only rootcoord has two more restart, other did not restart image

  1. standby rootcoord switch to active almost at the time when chaos applied image
  2. search and query failed start at 2023-08-23T07:44:31.680961Z

For the search/query failed issue, Rootcoord produce tt msg too slowly: image

image

this lead to search/query slowly and even timeout: image

image

I guess the reason for "Rootcoord produce tt msg too slowly" is that when RootCoord transitions from the standby state to the primary state, it needs to register the Pulsar producer, which may potentially cause some fluctuations in the performance of the Pulsar. Employing bounded consistency for querying or increasing the timeout duration for queries can alleviate this issue.

bigsheeper commented 1 year ago

/assign @zhuwenxing

zhuwenxing commented 1 year ago

@bigsheeper Any solution for it? I don't think this problem should be circumvented by having the user adjust the timeout, but by having the server-side address this performance fluctuation. Our expectation is that there will be no or relatively small performance fluctuations during an activated standby switchover.

/assign @bigsheeper

bigsheeper commented 1 year ago

@bigsheeper Any solution for it? I don't think this problem should be circumvented by having the user adjust the timeout, but by having the server-side address this performance fluctuation. Our expectation is that there will be no or relatively small performance fluctuations during an activated standby switchover.

/assign @bigsheeper

Two quick questions:

  1. Does this fluctuation occur with Kafka?
  2. Does this fluctuation unfailingly trigger every time activate-standby switch? And does its duration is same each occurrence?
zhuwenxing commented 1 year ago

There are also some failure requests when mq is Kafka failed job: https://qa-jenkins.milvus.io/blue/organizations/jenkins/standby_test_master/detail/standby_test_master/152/pipeline

log:

artifacts-rootcoord-pod-failure-152-server-logs.tar.gz

artifacts-rootcoord-pod-failure-152-pytest-logs.tar.gz

chaos inject and deleted time

[2023-08-29T09:19:47.580Z] [2023-08-29 09:19:47 - INFO - root]: chaos injected (test_chaos_apply_to_coord.py:91)

[2023-08-29T09:19:47.580Z] [2023-08-29 09:19:47 - INFO - root]: chaos inject result: PodChaos, test-rootcoord-standby-1693300787 (test_chaos_apply_to_coord.py:96)

[2023-08-29T09:20:55.263Z] test-rootcoord-standby-1693300787

[2023-08-29T09:20:55.265Z] [2023-08-29 09:20:47 - INFO - ci_test]: delete custom resource response: {'apiVersion': 'chaos-mesh.org/v1alpha1', 'kind': 'PodChaos', 'metadata': {'creationTimestamp': '2023-08-29T09:19:47Z', 'deletionGracePeriodSeconds': 0, 'deletionTimestamp': '2023-08-29T09:20:47Z', 'finalizers': ['chaos-mesh/records'], 'generation': 6, 'managedFields': [{'apiVersion': 'chaos-mesh.org/v1alpha1', 'fieldsType': 'FieldsV1', 'fieldsV1': {'f:spec': {'.': {}, 'f:action': {}, 'f:duration': {}, 'f:gracePeriod': {}, 'f:mode': {}, 'f:selector': {'.': {}, 'f:pods': {'.': {}, 'f:chaos-testing': {}}}}}, 'manager': 'OpenAPI-Generator', 'operation': 'Update', 'time': '2023-08-29T09:19:47Z'}, {'apiVersion': 'chaos-mesh.org/v1alpha1', 'fieldsType': 'FieldsV1', 'fieldsV1': {'f:metadata': {'f:finalizers': {'.': {}, 'v:"chaos-mesh/records"': {}}}, 'f:status': {'f:conditions': {}, 'f:experiment': {'f:containerRecords': {}, 'f:desiredPhase': {}}}}, 'manager': 'chaos-controller-manager', 'operation': 'Update', 'time': '2023-08-29T09:19:47Z'}], 'name': 'test-rootcoord-standby-1693300787', 'namespace': 'chaos-testing', 'resourceVersion': '365685862', 'uid': '8c5b889a-fdd0-4313-a08c-8c68c27ab2f2'}, 'spec': {'action': 'pod-failure', 'duration': '2m', 'mode': 'all', 'selector': {'namespaces': ['chaos-testing'], 'pods': {'chaos-testing': ['rootcoord-pod-failure-152-milvus-rootcoord-794dd696cf-ll7gg']}}}, 'status': {'conditions': [{'status': 'False', 'type': 'AllRecovered'}, {'status': 'False', 'type': 'Paused'}, {'status': 'True', 'type': 'Selected'}, {'status': 'True', 'type': 'AllInjected'}], 'experiment': {'containerRecords': [{'id': 'chaos-testing/rootcoord-pod-failure-152-milvus-rootcoord-794dd696cf-ll7gg', 'phase': 'Injected', 'selectorKey': '.'}], 'desiredPhase': 'Run'}}} (cus_resource_opts.py:54)

[2023-08-29T09:20:55.265Z] [2023-08-29 09:20:47 - INFO - root]: chaos deleted (test_chaos_apply_to_coord.py:100)

query failure

[2023-08-29T09:25:16.779Z] [2023-08-29 09:25:16 - INFO - ci_test]: QueryChecker failed at [('failure', 1, '2023-08-29T09:20:22.696640Z', 1693300822.696672), ('success', 2, '2023-08-29T09:20:33.700153Z', 1693300833.7001991), ('failure', 110, '2023-08-29T09:23:11.358053Z', 1693300991.358082), ('failure', 111, '2023-08-29T09:23:22.360962Z', 1693301002.3610048), ('failure', 112, '2023-08-29T09:23:33.363943Z', 1693301013.3639772), ('success', 113, '2023-08-29T09:23:44.366633Z', 1693301024.3666666)] (checker.py:157)

search failure

[2023-08-29T09:25:16.779Z] [2023-08-29 09:25:16 - INFO - ci_test]: SearchChecker failed at [('failure', 1, '2023-08-29T09:20:20.842607Z', 1693300820.842644), ('success', 2, '2023-08-29T09:20:31.846614Z', 1693300831.8466494), ('failure', 110, '2023-08-29T09:23:11.453612Z', 1693300991.453653), ('failure', 111, '2023-08-29T09:23:22.457996Z', 1693301002.4580276), ('failure', 112, '2023-08-29T09:23:33.462606Z', 1693301013.4626374), ('success', 113, '2023-08-29T09:23:44.466620Z', 1693301024.4666767)] (checker.py:157)
[2023-08-29T09:20:30.853Z] [2023-08-29 09:20:30 - ERROR - pymilvus.decorators]: grpc RpcError: [search], <_InactiveRpcError: StatusCode.DEADLINE_EXCEEDED, Deadline Exceeded>, <Time:{'RPC start': '2023-08-29 09:20:20.843719', 'gRPC error': '2023-08-29 09:20:30.845011'}> (decorators.py:137)
[2023-08-29T09:20:30.853Z] [2023-08-29 09:20:30 - ERROR - ci_test]: Traceback (most recent call last):
[2023-08-29T09:20:30.853Z]   File "/home/jenkins/agent/workspace/tests/python_client/utils/api_request.py", line 32, in inner_wrapper
[2023-08-29T09:20:30.853Z]     res = func(*args, **_kwargs)
[2023-08-29T09:20:30.853Z]   File "/home/jenkins/agent/workspace/tests/python_client/utils/api_request.py", line 63, in api_request
[2023-08-29T09:20:30.853Z]     return func(*arg, **kwargs)
[2023-08-29T09:20:30.853Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/orm/collection.py", line 779, in search
[2023-08-29T09:20:30.853Z]     res = conn.search(
[2023-08-29T09:20:30.853Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/decorators.py", line 141, in handler
[2023-08-29T09:20:30.853Z]     raise e from e
[2023-08-29T09:20:30.853Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/decorators.py", line 123, in handler
[2023-08-29T09:20:30.853Z]     return func(*args, **kwargs)
[2023-08-29T09:20:30.853Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/decorators.py", line 162, in handler
[2023-08-29T09:20:30.853Z]     return func(self, *args, **kwargs)
[2023-08-29T09:20:30.853Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/decorators.py", line 79, in handler
[2023-08-29T09:20:30.853Z]     raise e from e
[2023-08-29T09:20:30.853Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/decorators.py", line 68, in handler
[2023-08-29T09:20:30.853Z]     return func(*args, **kwargs)
[2023-08-29T09:20:30.853Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/client/grpc_handler.py", line 774, in search
[2023-08-29T09:20:30.853Z]     return self._execute_search_requests(
[2023-08-29T09:20:30.853Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/client/grpc_handler.py", line 735, in _execute_search_requests
[2023-08-29T09:20:30.853Z]     raise pre_err from pre_err
[2023-08-29T09:20:30.853Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/client/grpc_handler.py", line 723, in _execute_search_requests
[2023-08-29T09:20:30.853Z]     response = self._stub.Search(request, timeout=timeout)
[2023-08-29T09:20:30.853Z]   File "/usr/local/lib/python3.8/dist-packages/grpc/_interceptor.py", line 247, in __call__
[2023-08-29T09:20:30.853Z]     response, ignored_call = self._with_call(request,
[2023-08-29T09:20:30.853Z]   File "/usr/local/lib/python3.8/dist-packages/grpc/_interceptor.py", line 290, in _with_call
[2023-08-29T09:20:30.853Z]     return call.result(), call
[2023-08-29T09:20:30.853Z]   File "/usr/local/lib/python3.8/dist-packages/grpc/_channel.py", line 379, in result
[2023-08-29T09:20:30.853Z]     raise self
[2023-08-29T09:20:30.853Z]   File "/usr/local/lib/python3.8/dist-packages/grpc/_interceptor.py", line 274, in continuation
[2023-08-29T09:20:30.853Z]     response, call = self._thunk(new_method).with_call(
[2023-08-29T09:20:30.853Z]   File "/usr/local/lib/python3.8/dist-packages/grpc/_interceptor.py", line 301, in with_call
[2023-08-29T09:20:30.853Z]     return self._with_call(request,
[2023-08-29T09:20:30.853Z]   File "/usr/local/lib/python3.8/dist-packages/grpc/_interceptor.py", line 290, in _with_call
[2023-08-29T09:20:30.853Z]     return call.result(), call
[2023-08-29T09:20:30.853Z]   File "/usr/local/lib/python3.8/dist-packages/grpc/_channel.py", line 379, in result
[2023-08-29T09:20:30.853Z]     raise self
[2023-08-29T09:20:30.853Z]   File "/usr/local/lib/python3.8/dist-packages/grpc/_interceptor.py", line 274, in continuation
[2023-08-29T09:20:30.853Z]     response, call = self._thunk(new_method).with_call(
[2023-08-29T09:20:30.853Z]   File "/usr/local/lib/python3.8/dist-packages/grpc/_channel.py", line 1043, in with_call
[2023-08-29T09:20:30.853Z]     return _end_unary_response_blocking(state, call, True, None)
[2023-08-29T09:20:30.853Z]   File "/usr/local/lib/python3.8/dist-packages/grpc/_channel.py", line 910, in _end_unary_response_blocking
[2023-08-29T09:20:30.853Z]     raise _InactiveRpcError(state)  # pytype: disable=not-instantiable
[2023-08-29T09:20:30.853Z] grpc._channel._InactiveRpcError: <_InactiveRpcError of RPC that terminated with:
[2023-08-29T09:20:30.853Z]  status = StatusCode.DEADLINE_EXCEEDED
[2023-08-29T09:20:30.853Z]  details = "Deadline Exceeded"
[2023-08-29T09:20:30.853Z]  debug_error_string = "UNKNOWN:Deadline Exceeded {created_time:"2023-08-29T09:20:30.844609218+00:00", grpc_status:4}"
[2023-08-29T09:20:30.853Z] >
[2023-08-29T09:20:30.853Z]  (api_request.py:45)
[2023-08-29T09:20:30.853Z] [2023-08-29 09:20:30 - ERROR - ci_test]: (api_response) : <_InactiveRpcError of RPC that terminated with:
[2023-08-29T09:20:30.853Z]  status = StatusCode.DEADLINE_EXCEEDED
[2023-08-29T09:20:30.853Z]  details = "Deadline Exceeded"
[2023-08-29T09:20:30.853Z]  debug_error_string = "UNKNOWN:Deadline Exceeded {created_time:"2023-08-29T09:20:30.844609218+00:00", grpc_status:4}"
[2023-08-29T09:20:30.853Z] > (api_request.py:46)
[2023-08-29T09:20:30.853Z] [2023-08-29 09:20:30 - INFO - ci_test]: [test][start time:2023-08-29T09:20:20.842607Z][time cost:10.00319088s][operation_name:search][collection name:SearchChecker__xphQesHy] -> False (checker.py:59)
[2023-08-29T09:20:19.971Z] [2023-08-29 09:20:19 - ERROR - pymilvus.decorators]: grpc RpcError: [search], <_InactiveRpcError: StatusCode.DEADLINE_EXCEEDED, Deadline Exceeded>, <Time:{'RPC start': '2023-08-29 09:20:09.839754', 'gRPC error': '2023-08-29 09:20:19.841002'}> (decorators.py:137)
[2023-08-29T09:20:19.971Z] [2023-08-29 09:20:19 - ERROR - ci_test]: Traceback (most recent call last):
[2023-08-29T09:20:19.971Z]   File "/home/jenkins/agent/workspace/tests/python_client/utils/api_request.py", line 32, in inner_wrapper
[2023-08-29T09:20:19.971Z]     res = func(*args, **_kwargs)
[2023-08-29T09:20:19.971Z]   File "/home/jenkins/agent/workspace/tests/python_client/utils/api_request.py", line 63, in api_request
[2023-08-29T09:20:19.971Z]     return func(*arg, **kwargs)
[2023-08-29T09:20:19.971Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/orm/collection.py", line 779, in search
[2023-08-29T09:20:19.971Z]     res = conn.search(
[2023-08-29T09:20:19.971Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/decorators.py", line 141, in handler
[2023-08-29T09:20:19.971Z]     raise e from e
[2023-08-29T09:20:19.971Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/decorators.py", line 123, in handler
[2023-08-29T09:20:19.971Z]     return func(*args, **kwargs)
[2023-08-29T09:20:19.971Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/decorators.py", line 162, in handler
[2023-08-29T09:20:19.971Z]     return func(self, *args, **kwargs)
[2023-08-29T09:20:19.971Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/decorators.py", line 79, in handler
[2023-08-29T09:20:19.971Z]     raise e from e
[2023-08-29T09:20:19.971Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/decorators.py", line 68, in handler
[2023-08-29T09:20:19.971Z]     return func(*args, **kwargs)
[2023-08-29T09:20:19.971Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/client/grpc_handler.py", line 774, in search
[2023-08-29T09:20:19.971Z]     return self._execute_search_requests(
[2023-08-29T09:20:19.971Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/client/grpc_handler.py", line 735, in _execute_search_requests
[2023-08-29T09:20:19.971Z]     raise pre_err from pre_err
[2023-08-29T09:20:19.971Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/client/grpc_handler.py", line 723, in _execute_search_requests
[2023-08-29T09:20:19.971Z]     response = self._stub.Search(request, timeout=timeout)
[2023-08-29T09:20:19.971Z]   File "/usr/local/lib/python3.8/dist-packages/grpc/_interceptor.py", line 247, in __call__
[2023-08-29T09:20:19.971Z]     response, ignored_call = self._with_call(request,
[2023-08-29T09:20:19.971Z]   File "/usr/local/lib/python3.8/dist-packages/grpc/_interceptor.py", line 290, in _with_call
[2023-08-29T09:20:19.971Z]     return call.result(), call
[2023-08-29T09:20:19.971Z]   File "/usr/local/lib/python3.8/dist-packages/grpc/_channel.py", line 379, in result
[2023-08-29T09:20:19.971Z]     raise self
[2023-08-29T09:20:19.971Z]   File "/usr/local/lib/python3.8/dist-packages/grpc/_interceptor.py", line 274, in continuation
[2023-08-29T09:20:19.971Z]     response, call = self._thunk(new_method).with_call(
[2023-08-29T09:20:19.971Z]   File "/usr/local/lib/python3.8/dist-packages/grpc/_interceptor.py", line 301, in with_call
[2023-08-29T09:20:19.971Z]     return self._with_call(request,
[2023-08-29T09:20:19.971Z]   File "/usr/local/lib/python3.8/dist-packages/grpc/_interceptor.py", line 290, in _with_call
[2023-08-29T09:20:19.971Z]     return call.result(), call
[2023-08-29T09:20:19.971Z]   File "/usr/local/lib/python3.8/dist-packages/grpc/_channel.py", line 379, in result
[2023-08-29T09:20:19.971Z]     raise self
[2023-08-29T09:20:19.971Z]   File "/usr/local/lib/python3.8/dist-packages/grpc/_interceptor.py", line 274, in continuation
[2023-08-29T09:20:19.971Z]     response, call = self._thunk(new_method).with_call(
[2023-08-29T09:20:19.971Z]   File "/usr/local/lib/python3.8/dist-packages/grpc/_channel.py", line 1043, in with_call
[2023-08-29T09:20:19.971Z]     return _end_unary_response_blocking(state, call, True, None)
[2023-08-29T09:20:19.971Z]   File "/usr/local/lib/python3.8/dist-packages/grpc/_channel.py", line 910, in _end_unary_response_blocking
[2023-08-29T09:20:19.971Z]     raise _InactiveRpcError(state)  # pytype: disable=not-instantiable
[2023-08-29T09:20:19.971Z] grpc._channel._InactiveRpcError: <_InactiveRpcError of RPC that terminated with:
[2023-08-29T09:20:19.971Z]  status = StatusCode.DEADLINE_EXCEEDED
[2023-08-29T09:20:19.971Z]  details = "Deadline Exceeded"
[2023-08-29T09:20:19.971Z]  debug_error_string = "UNKNOWN:Deadline Exceeded {created_time:"2023-08-29T09:20:19.840610893+00:00", grpc_status:4}"
[2023-08-29T09:20:19.971Z] >
[2023-08-29T09:20:19.971Z]  (api_request.py:45)
[2023-08-29T09:20:19.971Z] [2023-08-29 09:20:19 - ERROR - ci_test]: (api_response) : <_InactiveRpcError of RPC that terminated with:
[2023-08-29T09:20:19.971Z]  status = StatusCode.DEADLINE_EXCEEDED
[2023-08-29T09:20:19.971Z]  details = "Deadline Exceeded"
[2023-08-29T09:20:19.971Z]  debug_error_string = "UNKNOWN:Deadline Exceeded {created_time:"2023-08-29T09:20:19.840610893+00:00", grpc_status:4}"
[2023-08-29T09:20:19.971Z] > (api_request.py:46)
[2023-08-29T09:20:19.971Z] [2023-08-29 09:20:19 - INFO - ci_test]: [test][start time:2023-08-29T09:20:09.838616Z][time cost:10.00360816s][operation_name:search][collection name:SearchChecker__xphQesHy] -> False (checker.py:59)
[2023-08-29T09:20:20.226Z] [2023-08-29 09:20:20 - INFO - ci_test]: [test][start time:2023-08-29T09:20:20.133618Z][time cost:0.00578582s][operation_name:delete][collection name:DeleteChecker__nunTMRbu] -> True (checker.py:59)
[2023-08-29T09:20:20.479Z] [2023-08-29 09:20:20 - ERROR - pymilvus.decorators]: RPC error: [create_collection], <MilvusException: (code=1, message=context deadline exceeded)>, <Time:{'RPC start': '2023-08-29 09:20:10.369775', 'RPC error': '2023-08-29 09:20:20.384666'}> (decorators.py:126)
[2023-08-29T09:20:20.479Z] [2023-08-29 09:20:20 - ERROR - ci_test]: 'NoneType' object has no attribute 'name' (checker.py:89)
[2023-08-29T09:20:21.400Z] [2023-08-29 09:20:21 - INFO - ci_test]: [test][start time:2023-08-29T09:20:21.139624Z][time cost:0.00700157s][operation_name:delete][collection name:DeleteChecker__nunTMRbu] -> True (checker.py:59)
[2023-08-29T09:20:21.958Z] [2023-08-29 09:20:21 - ERROR - pymilvus.decorators]: grpc RpcError: [query], <_MultiThreadedRendezvous: StatusCode.DEADLINE_EXCEEDED, Deadline Exceeded>, <Time:{'RPC start': '2023-08-29 09:20:11.693896', 'gRPC error': '2023-08-29 09:20:21.694998'}> (decorators.py:137)
[2023-08-29T09:20:21.958Z] [2023-08-29 09:20:21 - ERROR - ci_test]: Traceback (most recent call last):
[2023-08-29T09:20:21.958Z]   File "/home/jenkins/agent/workspace/tests/python_client/utils/api_request.py", line 32, in inner_wrapper
[2023-08-29T09:20:21.958Z]     res = func(*args, **_kwargs)
[2023-08-29T09:20:21.958Z]   File "/home/jenkins/agent/workspace/tests/python_client/utils/api_request.py", line 63, in api_request
[2023-08-29T09:20:21.958Z]     return func(*arg, **kwargs)
[2023-08-29T09:20:21.958Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/orm/collection.py", line 913, in query
[2023-08-29T09:20:21.958Z]     return conn.query(
[2023-08-29T09:20:21.958Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/decorators.py", line 141, in handler
[2023-08-29T09:20:21.958Z]     raise e from e
[2023-08-29T09:20:21.958Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/decorators.py", line 123, in handler
[2023-08-29T09:20:21.958Z]     return func(*args, **kwargs)
[2023-08-29T09:20:21.958Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/decorators.py", line 162, in handler
[2023-08-29T09:20:21.958Z]     return func(self, *args, **kwargs)
[2023-08-29T09:20:21.958Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/decorators.py", line 79, in handler
[2023-08-29T09:20:21.958Z]     raise e from e
[2023-08-29T09:20:21.958Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/decorators.py", line 68, in handler
[2023-08-29T09:20:21.958Z]     return func(*args, **kwargs)
[2023-08-29T09:20:21.958Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/client/grpc_handler.py", line 1346, in query
[2023-08-29T09:20:21.958Z]     response = future.result()
[2023-08-29T09:20:21.958Z]   File "/usr/local/lib/python3.8/dist-packages/grpc/_channel.py", line 797, in result
[2023-08-29T09:20:21.958Z]     raise self
[2023-08-29T09:20:21.958Z] grpc._channel._MultiThreadedRendezvous: <_MultiThreadedRendezvous of RPC that terminated with:
[2023-08-29T09:20:21.958Z]  status = StatusCode.DEADLINE_EXCEEDED
[2023-08-29T09:20:21.958Z]  details = "Deadline Exceeded"
[2023-08-29T09:20:21.958Z]  debug_error_string = "UNKNOWN:Deadline Exceeded {grpc_status:4, created_time:"2023-08-29T09:20:21.694605972+00:00"}"
[2023-08-29T09:20:21.958Z] >
[2023-08-29T09:20:21.958Z]  (api_request.py:45)
[2023-08-29T09:20:21.958Z] [2023-08-29 09:20:21 - ERROR - ci_test]: (api_response) : <_MultiThreadedRendezvous of RPC that terminated with:
[2023-08-29T09:20:21.958Z]  status = StatusCode.DEADLINE_EXCEEDED
[2023-08-29T09:20:21.958Z]  details = "Deadline Exceeded"
[2023-08-29T09:20:21.958Z]  debug_error_string = "UNKNOWN:Deadline Exceeded {grpc_status:4, created_time:"2023-08-29T09:20:21.694605972+00:00"}"
[2023-08-29T09:20:21.958Z] > (api_request.py:46)
[2023-08-29T09:20:21.958Z] [2023-08-29 09:20:21 - INFO - ci_test]: [test][start time:2023-08-29T09:20:11.693640Z][time cost:10.00213264s][operation_name:query][collection name:QueryChecker__TtaUqH9t] -> False (checker.py:59)
zhuwenxing commented 1 year ago

@bigsheeper

Does this fluctuation occur with Kafka?

Yes, see https://github.com/milvus-io/milvus/issues/26572#issuecomment-1697143356

Does this fluctuation unfailingly trigger every time activate-standby switch? And does its duration is same each occurrence?

This is a stable issue, so I think the answer is Yes

zhuwenxing commented 1 year ago

Failure will keep for about one minute with the image tag 2.3.0-20230911-57bed4a0


[2023-09-11T08:16:15.999Z] >>   pytest.assume(

[2023-09-11T08:16:15.999Z] AssertionError: Expect Succ: Op.index succ rate 0.7142857142857143, total: 7, average time: 48.2380

[2023-09-11T08:16:15.999Z] assert False

[2023-09-11T08:16:15.999Z] 

[2023-09-11T08:16:16.000Z] chaos_commons.py:112: AssumptionFailure

[2023-09-11T08:16:16.000Z] >>   pytest.assume(

[2023-09-11T08:16:16.000Z] AssertionError: Expect Succ: Op.search succ rate 0.9781659388646288, total: 229, average time: 0.3623

[2023-09-11T08:16:16.000Z] assert False

[2023-09-11T08:16:16.000Z] 

[2023-09-11T08:16:16.000Z] chaos_commons.py:112: AssumptionFailure

[2023-09-11T08:16:16.000Z] >>   pytest.assume(

[2023-09-11T08:16:16.000Z] AssertionError: Expect Succ: Op.query succ rate 0.9782608695652174, total: 230, average time: 0.3568

[2023-09-11T08:16:16.000Z] assert False

[2023-09-11T08:16:16.000Z] 

[2023-09-11T08:16:16.000Z] testcases/test_single_request_operation_for_standby.py:103: AssumptionFailure

[2023-09-11T08:16:16.000Z] >>   pytest.assume(rto < 30,  f"{k} rto expect 30s but get {rto}s")  # rto should be less than 30s

[2023-09-11T08:16:16.000Z] AssertionError: Op.index rto expect 30s but get 73.15737009048462s

[2023-09-11T08:16:16.000Z] assert False

[2023-09-11T08:16:16.000Z] 

[2023-09-11T08:16:16.000Z] testcases/test_single_request_operation_for_standby.py:103: AssumptionFailure

[2023-09-11T08:16:16.000Z] >>   pytest.assume(rto < 30,  f"{k} rto expect 30s but get {rto}s")  # rto should be less than 30s

[2023-09-11T08:16:16.000Z] AssertionError: Op.search rto expect 30s but get 55.040268659591675s

[2023-09-11T08:16:16.000Z] assert False

[2023-09-11T08:16:16.000Z] 

[2023-09-11T08:16:16.000Z] testcases/test_single_request_operation_for_standby.py:103: AssumptionFailure

[2023-09-11T08:16:16.000Z] >>   pytest.assume(rto < 30,  f"{k} rto expect 30s but get {rto}s")  # rto should be less than 30s

[2023-09-11T08:16:16.000Z] AssertionError: Op.query rto expect 30s but get 55.03235363960266s

[2023-09-11T08:16:16.000Z] assert False

failed job: https://qa-jenkins.milvus.io/blue/organizations/jenkins/standby_test_release/detail/standby_test_release/2091/pipeline

log: artifacts-rootcoord-pod-failure-2091-pytest-logs.tar.gz

artifacts-rootcoord-pod-failure-2091-server-logs.tar.gz

zhuwenxing commented 1 year ago

/assign @wayblink PTAL

yanliang567 commented 9 months ago

close for now, will file a new one if reproduce