milvus-io / milvus

A cloud-native vector database, storage for next generation AI applications
https://milvus.io
Apache License 2.0
28.19k stars 2.72k forks source link

[Bug]: Query failed: failed to query: segment lacks[segment=450413621918023414] #33920

Open ThreadDao opened 3 weeks ago

ThreadDao commented 3 weeks ago

Is there an existing issue for this?

Environment

- Milvus version: 2.4-20240614-5fc1370f-amd64
- Deployment mode(standalone or cluster):cluster
- MQ type(rocksmq, pulsar or kafka):    
- SDK version(e.g. pymilvus v2.0.0rc2):
- OS(Ubuntu or CentOS): 
- CPU/Memory: 
- GPU: 
- Others:

Current Behavior

1. deploy milvus with image: milvus-io-2.4-eeba851-20240612

2. test steps

  1. create collection fouram_qb77Q7fh -> index
  2. insert 10m-128d entities -> flush
  3. index again and load
  4. concurrent upsert from pk-id 0 image

upgrade image to 2.4-20240614-5fc1370f-amd64

query failed

c.query("id >= 0", output_fields=["count(*)"], consistency_level="Strong")
RPC error: [query], <MilvusException: (code=503, message=failed to query: segment lacks[segment=450413621918023414]: channel not available[channel=compact-opt-mem-rootcoord-dml_0_450413621907816811v0])>, <Time:{'RPC start': '2024-06-17 15:46:02.087033', 'RPC error': '2024-06-17 15:46:02.108954'}>
Traceback (most recent call last):
  File "/home/zong/Downloads/pycharm-community-2023.2.5/plugins/python-ce/helpers/pydev/pydevconsole.py", line 364, in runcode
    coro = func()
  File "<input>", line 1, in <module>
  File "/home/zong/zong/.virtualenvs/fouram/lib/python3.8/site-packages/pymilvus/orm/collection.py", line 1078, in query
    return conn.query(
  File "/home/zong/zong/.virtualenvs/fouram/lib/python3.8/site-packages/pymilvus/decorators.py", line 140, in handler
    raise e from e
  File "/home/zong/zong/.virtualenvs/fouram/lib/python3.8/site-packages/pymilvus/decorators.py", line 136, in handler
    return func(*args, **kwargs)
  File "/home/zong/zong/.virtualenvs/fouram/lib/python3.8/site-packages/pymilvus/decorators.py", line 175, in handler
    return func(self, *args, **kwargs)
  File "/home/zong/zong/.virtualenvs/fouram/lib/python3.8/site-packages/pymilvus/decorators.py", line 115, in handler
    raise e from e
  File "/home/zong/zong/.virtualenvs/fouram/lib/python3.8/site-packages/pymilvus/decorators.py", line 86, in handler
    return func(*args, **kwargs)
  File "/home/zong/zong/.virtualenvs/fouram/lib/python3.8/site-packages/pymilvus/client/grpc_handler.py", line 1487, in query
    check_status(response.status)
  File "/home/zong/zong/.virtualenvs/fouram/lib/python3.8/site-packages/pymilvus/client/utils.py", line 62, in check_status
    raise MilvusException(status.code, status.reason, status.error_code)
pymilvus.exceptions.MilvusException: <MilvusException: (code=503, message=failed to query: segment lacks[segment=450413621918023414]: channel not available[channel=compact-opt-mem-rootcoord-dml_0_450413621907816811v0])>

Expected Behavior

No response

Steps To Reproduce

https://argo-workflows.zilliz.cc/archived-workflows/qa/1a09fb89-1185-4d5f-9ee9-3334d95e6c19?nodeId=compact-opt-no-flush-3

Milvus Log

pods:

compact-opt-mem-milvus-datanode-bcdc4ffd9-n86fk                   1/1     Running     0                121m    10.104.13.149   4am-node16   <none>           <none>
compact-opt-mem-milvus-indexnode-5f6c6dc8d5-2xhg9                 1/1     Running     0                6h35m   10.104.25.24    4am-node30   <none>           <none>
compact-opt-mem-milvus-indexnode-5f6c6dc8d5-hw2nl                 1/1     Running     0                6h34m   10.104.17.28    4am-node23   <none>           <none>
compact-opt-mem-milvus-indexnode-5f6c6dc8d5-npzxn                 1/1     Running     0                6h34m   10.104.16.241   4am-node21   <none>           <none>
compact-opt-mem-milvus-mixcoord-7d4d647d65-jgqjk                  1/1     Running     0                6h34m   10.104.14.222   4am-node18   <none>           <none>
compact-opt-mem-milvus-proxy-6d5b87448-85t59                      1/1     Running     0                6h35m   10.104.20.106   4am-node22   <none>           <none>
compact-opt-mem-milvus-querynode-1-54d9466959-csc2m               1/1     Running     0                6h35m   10.104.18.142   4am-node25   <none>           <none>
compact-opt-mem-milvus-querynode-1-54d9466959-m2xwj               1/1     Running     0                6h3m    10.104.30.211   4am-node38   <none>           <none>

Anything else?

No response

yanliang567 commented 3 weeks ago

/assign @congqixia /unassign

ThreadDao commented 3 weeks ago

@XuanYang-cn @congqixia

  1. release-load can fix it. however, reload 9910041 entities costs almost 3hours metrics of compact-opt-mem load image image
ThreadDao commented 3 weeks ago

@XuanYang-cn @congqixia

  1. Strange things happened The client does not have any request to upsert or insert data. Before second-reload, count() return 9910041, after second-reload, count() return 10292193. By the way, the second reload cost 14.4098s (L1 mixCompactione done)
    c.query('id >=0', output_fields=["count(*)"])
    data: ["{'count(*)': 9910041}"] ..., extra_info: {'cost': 0}
    9910041*128*4/1024/1024/1024
    4.725475788116455
    c.query('id >=0', output_fields=["count(*)"])
    data: ["{'count(*)': 9910041}"] ..., extra_info: {'cost': 0}
    c.release()
    186/24
    7.75
    c.query('id >=0', output_fields=["count(*)"])
    data: ["{'count(*)': 10292193}"] ..., extra_info: {'cost': 0}
    c.query('id >=0', output_fields=["count(*)"])
    data: ["{'count(*)': 10292193}"] ..., extra_info: {'cost': 0}
    c.query('id >=0', output_fields=["count(*)"], consistency_level="Strong")
    data: ["{'count(*)': 10292193}"] ..., extra_info: {'cost': 0}

    Time range reference grafana links: metrics of compact-opt-mem second-reload

XuanYang-cn commented 2 weeks ago

Count(*) part might related to #33955

XuanYang-cn commented 2 weeks ago

@XuanYang-cn @congqixia 2. release-load can fix it. however, reload 9910041 entities costs almost 3hours

Seems like a known issue, when target changes so quickly and load is continuous moving forward, it'll wait for a long time.