milvus-io / milvus

A cloud-native vector database, storage for next generation AI applications
https://milvus.io
Apache License 2.0
30.89k stars 2.94k forks source link

[Bug]: It seems that the indexnode is not working and the number of unissued index tasks keeps increasing #36389

Closed ThreadDao closed 1 week ago

ThreadDao commented 2 months ago

Is there an existing issue for this?

Environment

- Milvus version: master-20240918-a7dcc50f
- Deployment mode(standalone or cluster):cluster
- MQ type(rocksmq, pulsar or kafka):    
- SDK version(e.g. pymilvus v2.0.0rc2):
- OS(Ubuntu or CentOS): 
- CPU/Memory: 
- GPU: 
- Others:

Current Behavior

test steps

  1. create a collection with 1 shard and 16 num_partitions
  2. insert 30m-128d data
  3. flush -> index -> load
  4. concurrent requests: search + upsert + flush image
  5. dataCoord shows that collection has 14.8k L1 segments, indexNode was not working and unissued tasks keeps increasing
    • segment num and index task image
    • indexNode cpu usage image
    • the mixcoord CPU usage is relatively high image By the way, proxy and mixcoord restart twice due to lost conmnection from etcd

Expected Behavior

No response

Steps To Reproduce

https://argo-workflows.zilliz.cc/archived-workflows/qa/d026185e-699d-49eb-b4a5-8fc54af0ddae?nodeId=level-zero-stable-master-1726671600-2235253561

Milvus Log

level-master-pa-key-op-44-3730-etcd-0                             1/1     Running       0                36h     10.104.19.59    4am-node28   <none>           <none>
level-master-pa-key-op-44-3730-etcd-1                             1/1     Running       0                36h     10.104.30.2     4am-node38   <none>           <none>
level-master-pa-key-op-44-3730-etcd-2                             1/1     Running       0                36h     10.104.27.38    4am-node31   <none>           <none>
level-master-pa-key-op-44-3730-milvus-datanode-6c856c7db-6fgxq    1/1     Running       1 (36h ago)      36h     10.104.1.207    4am-node10   <none>           <none>
level-master-pa-key-op-44-3730-milvus-indexnode-7d8bbf4968cvbrn   1/1     Running       0                36h     10.104.20.245   4am-node22   <none>           <none>
level-master-pa-key-op-44-3730-milvus-indexnode-7d8bbf4968phzzw   1/1     Running       0                36h     10.104.33.149   4am-node36   <none>           <none>
level-master-pa-key-op-44-3730-milvus-mixcoord-5b44b997c8-84nxd   1/1     Running       2 (31h ago)      36h     10.104.13.102   4am-node16   <none>           <none>
level-master-pa-key-op-44-3730-milvus-proxy-5dbd8c8679-kzzp9      1/1     Running       2 (31h ago)      36h     10.104.1.206    4am-node10   <none>           <none>
level-master-pa-key-op-44-3730-milvus-querynode-0-78577d89d5d6h   1/1     Running       0                36h     10.104.17.15    4am-node23   <none>           <none>
level-master-pa-key-op-44-3730-milvus-querynode-0-78577d89pbv58   1/1     Running       5 (25h ago)      36h     10.104.18.229   4am-node25   <none>           <none>
level-master-pa-key-op-44-3730-milvus-querynode-0-78577d89r45sd   1/1     Running       1 (29h ago)      36h     10.104.20.247   4am-node22   <none>           <none>
level-master-pa-key-op-44-3730-milvus-querynode-0-78577d89s9vcz   1/1     Running       1 (25h ago)      36h     10.104.32.190   4am-node39   <none>           <none>
level-master-pa-key-op-44-3730-minio-0                            1/1     Running       0                36h     10.104.27.42    4am-node31   <none>           <none>
level-master-pa-key-op-44-3730-minio-1                            1/1     Running       0                36h     10.104.34.105   4am-node37   <none>           <none>
level-master-pa-key-op-44-3730-minio-2                            1/1     Running       0                36h     10.104.25.243   4am-node30   <none>           <none>
level-master-pa-key-op-44-3730-minio-3                            1/1     Running       0                36h     10.104.19.62    4am-node28   <none>           <none>
level-master-pa-key-op-44-3730-pulsar-bookie-0                    1/1     Running       0                36h     10.104.34.106   4am-node37   <none>           <none>
level-master-pa-key-op-44-3730-pulsar-bookie-1                    1/1     Running       0                36h     10.104.27.43    4am-node31   <none>           <none>
level-master-pa-key-op-44-3730-pulsar-bookie-2                    1/1     Running       0                36h     10.104.15.93    4am-node20   <none>           <none>
level-master-pa-key-op-44-3730-pulsar-broker-0                    1/1     Running       0                36h     10.104.1.197    4am-node10   <none>           <none>
level-master-pa-key-op-44-3730-pulsar-proxy-0                     1/1     Running       0                36h     10.104.4.198    4am-node11   <none>           <none>
level-master-pa-key-op-44-3730-pulsar-recovery-0                  1/1     Running       0                36h     10.104.25.241   4am-node30   <none>           <none>
level-master-pa-key-op-44-3730-pulsar-zookeeper-0                 1/1     Running       0                36h     10.104.19.63    4am-node28   <none>           <none>
level-master-pa-key-op-44-3730-pulsar-zookeeper-1                 1/1     Running       0                36h     10.104.34.113   4am-node37   <none>           <none>
level-master-pa-key-op-44-3730-pulsar-zookeeper-2                 1/1     Running       0                36h     10.104.16.68    4am-node21   <none>           <none>

Anything else?

No response

xiaocai2333 commented 2 months ago

Currently, a stats task is performed before building the index for segments. Due to the large number of segments, it appears on the monitoring system that the number of unissued index tasks is slowly increasing. In fact, this is because the stats tasks are completing slowly. The reason for the slow completion of the stats tasks is that the connection to MinIO/S3 on the index node is taking too long (more than 10 seconds), causing the datacoord to think that the task assignment has failed. However, the index node has already cached the task. When datacoord reassigns the task, it does not first clear the cache on the index node, resulting in the index node reporting that the task already exists during reassignment. This blocks the task's progress, which explains why tasks execute quickly after the index node is restarted. And pull request #36371 will fix it.

xiaofan-luan commented 2 months ago

Currently, a stats task is performed before building the index for segments. Due to the large number of segments, it appears on the monitoring system that the number of unissued index tasks is slowly increasing. In fact, this is because the stats tasks are completing slowly. The reason for the slow completion of the stats tasks is that the connection to MinIO/S3 on the index node is taking too long (more than 10 seconds), causing the datacoord to think that the task assignment has failed. However, the index node has already cached the task. When datacoord reassigns the task, it does not first clear the cache on the index node, resulting in the index node reporting that the task already exists during reassignment. This blocks the task's progress, which explains why tasks execute quickly after the index node is restarted. And pull request #36371 will fix it.

  1. 10s timeout seems to be too slow
  2. we need a cancel logic to make sure index task is canceled.
xiaocai2333 commented 1 month ago

Currently, a stats task is performed before building the index for segments. Due to the large number of segments, it appears on the monitoring system that the number of unissued index tasks is slowly increasing. In fact, this is because the stats tasks are completing slowly. The reason for the slow completion of the stats tasks is that the connection to MinIO/S3 on the index node is taking too long (more than 10 seconds), causing the datacoord to think that the task assignment has failed. However, the index node has already cached the task. When datacoord reassigns the task, it does not first clear the cache on the index node, resulting in the index node reporting that the task already exists during reassignment. This blocks the task's progress, which explains why tasks execute quickly after the index node is restarted. And pull request #36371 will fix it.

  1. 10s timeout seems to be too slow
  2. we need a cancel logic to make sure index task is canceled.

pr #36371 will ensure tasks are cleaned up.

xiaofan-luan commented 1 month ago

Currently, a stats task is performed before building the index for segments. Due to the large number of segments, it appears on the monitoring system that the number of unissued index tasks is slowly increasing. In fact, this is because the stats tasks are completing slowly. The reason for the slow completion of the stats tasks is that the connection to MinIO/S3 on the index node is taking too long (more than 10 seconds), causing the datacoord to think that the task assignment has failed. However, the index node has already cached the task. When datacoord reassigns the task, it does not first clear the cache on the index node, resulting in the index node reporting that the task already exists during reassignment. This blocks the task's progress, which explains why tasks execute quickly after the index node is restarted. And pull request #36371 will fix it.

  1. 10s timeout seems to be too slow
  2. we need a cancel logic to make sure index task is canceled.

pr #36371 will ensure tasks are cleaned up.

is 10s a too short for stats task? maybe we could try to increase to 30s

xiaocai2333 commented 3 weeks ago

Currently, a stats task is performed before building the index for segments. Due to the large number of segments, it appears on the monitoring system that the number of unissued index tasks is slowly increasing. In fact, this is because the stats tasks are completing slowly. The reason for the slow completion of the stats tasks is that the connection to MinIO/S3 on the index node is taking too long (more than 10 seconds), causing the datacoord to think that the task assignment has failed. However, the index node has already cached the task. When datacoord reassigns the task, it does not first clear the cache on the index node, resulting in the index node reporting that the task already exists during reassignment. This blocks the task's progress, which explains why tasks execute quickly after the index node is restarted. And pull request #36371 will fix it.

  1. 10s timeout seems to be too slow
  2. we need a cancel logic to make sure index task is canceled.

pr #36371 will ensure tasks are cleaned up.

is 10s a too short for stats task? maybe we could try to increase to 30s

I think 10s is enough, as this is only the time for sending the request, not the time to wait for the stats task to complete.

xiaocai2333 commented 3 weeks ago

/assign @ThreadDao please verify it. /unassign

ThreadDao commented 1 week ago

cant reproduce