milvus-io / milvus

A cloud-native vector database, storage for next generation AI applications
https://milvus.io
Apache License 2.0
30.89k stars 2.94k forks source link

[Bug]: When clustering compaction is enabled and concurrent searches and upserts are performed, dn oomkilled and searches fail #36686

Closed ThreadDao closed 4 weeks ago

ThreadDao commented 1 month ago

Is there an existing issue for this?

Environment

- Milvus version: master-20240930-2055df81-amd64
- Deployment mode(standalone or cluster):cluster
- MQ type(rocksmq, pulsar or kafka):    
- SDK version(e.g. pymilvus v2.0.0rc2):
- OS(Ubuntu or CentOS): 
- CPU/Memory: 
- GPU: 
- Others:

Current Behavior

deploy a milvus cluster

-  with config:

config: dataCoord: compaction: clustering: autoEnable: true enableActiveStandby: true indexCoord: enableActiveStandby: true log: level: debug queryCoord: enableActiveStandby: true queryNode: levelZeroForwardPolicy: RemoteLoad rootCoord: enableActiveStandby: true trace: exporter: jaeger jaeger: url: http://tempo-distributor.tempo:14268/api/traces sampleFraction: 1


### test steps
1. create a collection `fouram_j7MgI6d5` with 3 fields: id(varchar pk with max_length=64) + float_vector(128dim) + int64_ck (clustering_key)

{'auto_id': False, 'description': '', 'fields': [{'name': 'id', 'description': '', 'type': <DataType.VARCHAR: 21>, 'params': {'max_length': 64}, 'is_primary': True, 'auto_id': False}, {'name': 'float_vector', 'description': '', 'type': <DataType.FLOAT_VECTOR: 101>, 'params': {'dim': 128}}, {'name': 'int64_ck', 'description': '', 'type': <DataType.INT64: 5>, 'is_clustering_key': True}], 'enable_dynamic_field': False}

2.  create index for vectors
3. insert 30m entities, their pk columns are all 64 in length
4. flush -> index again -> load
5. concurrent requests: query + search + upsert
![image](https://github.com/user-attachments/assets/6155038b-5697-4328-87cf-44b4da6f477e)

### results
1. 40% search and 40% query failed

[2024-09-30 10:42:54,563 - ERROR - fouram]: (api_response) : [Collection.search] <_InactiveRpcError of RPC that terminated with: status = StatusCode.DEADLINE_EXCEEDED details = "Deadline Exceeded" debug_error_string = "UNKNOWN:Error received from peer {created_time:"2024-09-30T10:42:54.554845202+00:00", grpc_status:4, grpc_message:"Deadline Exceeded"}"

, [requestId: 7844132a-7f18-11ef-a6ba-5243da83fde5] (api_request.py:57)

[2024-09-30 10:44:55,398 - ERROR - fouram]: (api_response) : [Collection.query] <_MultiThreadedRendezvous of RPC that terminated with: status = StatusCode.DEADLINE_EXCEEDED details = "Deadline Exceeded" debug_error_string = "UNKNOWN:Error received from peer {grpc_message:"Deadline Exceeded", grpc_status:4, created_time:"2024-09-30T10:42:55.295962416+00:00"}"

, [requestId: 78b65854-7f18-11ef-a6ba-5243da83fde5] (api_request.py:57)

2. dataNode keeps oom, about 60 times

stats-varchar-cring-op-57-4369-milvus-datanode-74c684d448-wkgdt 1/1 Running 60 (140m ago) 7d21h 10.104.14.194 4am-node18


3. Clustering compaction seems to be executed repeatedly
- I saw all flushed segments level is 2
![image](https://github.com/user-attachments/assets/89c9946b-108f-443e-af93-86ec9f9f588a)
- after a while, all flushed segment level turns to 1 ?
![image](https://github.com/user-attachments/assets/0a6e336a-8199-4ddd-9303-c35c29ce2beb)

image

  1. wrong metrics, flushed-L1 keeps increasing, flushed-L2 is always negative.they are both not same with birdwatcher image
    --- Growing: 0, Sealed: 0, Flushed: 22, Dropped: 110
    --- Small Segments: 0, row count: 0  Other Segments: 22, row count: 31185000
    --- Total Segments: 22, row count: 31185000

Expected Behavior

No response

Steps To Reproduce

https://argo-workflows.zilliz.cc/archived-workflows/qa/6797e23d-7b9f-45c5-9bdd-84ed8c2b67b9?nodeId=zong-stats-clustering-24h-1

Milvus Log

pods:

stats-varchar-cring-op-57-4369-etcd-0                             1/1     Running     0                7d21h   10.104.33.40    4am-node36   <none>           <none>
stats-varchar-cring-op-57-4369-etcd-1                             1/1     Running     0                7d21h   10.104.20.29    4am-node22   <none>           <none>
stats-varchar-cring-op-57-4369-etcd-2                             1/1     Running     0                7d21h   10.104.34.242   4am-node37   <none>           <none>
stats-varchar-cring-op-57-4369-milvus-datanode-74c684d448-wkgdt   1/1     Running     60 (140m ago)    7d21h   10.104.14.194   4am-node18   <none>           <none>
stats-varchar-cring-op-57-4369-milvus-indexnode-8588485b47q62wp   1/1     Running     0                7d21h   10.104.33.55    4am-node36   <none>           <none>
stats-varchar-cring-op-57-4369-milvus-indexnode-8588485b47sp9zq   1/1     Running     0                7d21h   10.104.13.212   4am-node16   <none>           <none>
stats-varchar-cring-op-57-4369-milvus-mixcoord-5bfc78b85-28zk8    1/1     Running     0                7d21h   10.104.5.135    4am-node12   <none>           <none>
stats-varchar-cring-op-57-4369-milvus-proxy-5c598886-g8jj8        1/1     Running     0                7d21h   10.104.20.45    4am-node22   <none>           <none>
stats-varchar-cring-op-57-4369-milvus-querynode-0-6ccbc88ffdzmx   1/1     Running     0                7d21h   10.104.18.27    4am-node25   <none>           <none>
stats-varchar-cring-op-57-4369-milvus-querynode-0-6ccbc88fwj55m   1/1     Running     0                7d21h   10.104.34.250   4am-node37   <none>           <none>
stats-varchar-cring-op-57-4369-minio-0                            1/1     Running     0                7d21h   10.104.33.42    4am-node36   <none>           <none>
stats-varchar-cring-op-57-4369-minio-1                            1/1     Running     0                7d21h   10.104.34.243   4am-node37   <none>           <none>
stats-varchar-cring-op-57-4369-minio-2                            1/1     Running     0                7d21h   10.104.18.22    4am-node25   <none>           <none>
stats-varchar-cring-op-57-4369-minio-3                            1/1     Running     0                7d21h   10.104.20.30    4am-node22   <none>           <none>
stats-varchar-cring-op-57-4369-pulsar-bookie-0                    1/1     Running     0                7d21h   10.104.34.244   4am-node37   <none>           <none>
stats-varchar-cring-op-57-4369-pulsar-bookie-1                    1/1     Running     0                7d21h   10.104.18.24    4am-node25   <none>           <none>
stats-varchar-cring-op-57-4369-pulsar-bookie-2                    1/1     Running     0                7d21h   10.104.33.46    4am-node36   <none>           <none>
stats-varchar-cring-op-57-4369-pulsar-bookie-init-q7xq7           0/1     Completed   0                7d21h   10.104.13.195   4am-node16   <none>           <none>
stats-varchar-cring-op-57-4369-pulsar-broker-0                    1/1     Running     0                7d21h   10.104.1.99     4am-node10   <none>           <none>
stats-varchar-cring-op-57-4369-pulsar-proxy-0                     1/1     Running     0                7d21h   10.104.20.26    4am-node22   <none>           <none>
stats-varchar-cring-op-57-4369-pulsar-pulsar-init-pzwsw           0/1     Completed   0                7d21h   10.104.1.98     4am-node10   <none>           <none>
stats-varchar-cring-op-57-4369-pulsar-recovery-0                  1/1     Running     0                7d21h   10.104.13.196   4am-node16   <none>           <none>
stats-varchar-cring-op-57-4369-pulsar-zookeeper-0                 1/1     Running     0                7d21h   10.104.33.43    4am-node36   <none>           <none>
stats-varchar-cring-op-57-4369-pulsar-zookeeper-1                 1/1     Running     0                7d21h   10.104.34.246   4am-node37   <none>           <none>
stats-varchar-cring-op-57-4369-pulsar-zookeeper-2                 1/1     Running     0                7d21h   10.104.20.33    4am-node22   <none>           <none>

Anything else?

No response

xiaofan-luan commented 1 month ago

/assign @wayblink

please help on this

xiaocai2333 commented 1 month ago

Two issues were discovered:

  1. After the stats task, the segment's level was set to L1, which caused the segments resulting from clustering compaction to also be L1. This led to continuous triggering of clustering compaction.
  2. It appears that there is a memory leak in clustering compaction. In the image below, the two boxes on the left and right indicate the times when the datanode was restarted. Two clustering compactions were performed in between, and the resident memory of the datanode increased.

74f4e976-002a-4627-b63d-834b4844a06e

xiaofan-luan commented 1 month ago

@xiaocai2333 @czs007 I think there is a conflict right now. When stats task triggered, the segment id changed. all the partition info is out dated (Because they record segment id info)

xiaocai2333 commented 1 month ago

Based on the current situation, the state here should not be L2 but should be L1. It was discovered that an error occurred in resetting the segment level after the clustering compaction failed. image

xiaocai2333 commented 1 month ago

The reason is that the compaction task leaked, and since the compaction task contains a mapping from cluster key to buffer, this caused a large memory leak. I will fix it right away.

ThreadDao commented 4 weeks ago

fixed master-20241023-1d61b604-amd64, no dataNode oom and no clustering copaction loop