[Bug]: When clustering compaction is enabled and concurrent searches and upserts are performed, dn oomkilled and searches fail

ThreadDao commented 3 days ago

Is there an existing issue for this?

[X] I have searched the existing issues

Environment

- Milvus version: master-20240930-2055df81-amd64
- Deployment mode(standalone or cluster):cluster
- MQ type(rocksmq, pulsar or kafka):    
- SDK version(e.g. pymilvus v2.0.0rc2):
- OS(Ubuntu or CentOS): 
- CPU/Memory: 
- GPU: 
- Others:

Current Behavior

deploy a milvus cluster

with resource


dataNode:
  replicas: 1
  resources:
    limits:
      cpu: "4"
      memory: 16Gi
    requests:
      cpu: "2"
      memory: 8Gi
queryNode:
  replicas: 2
  resources:
    limits:
      cpu: "8" 
      memory: 32Gi
    requests:
      cpu: "4" 
      memory: 16Gi

-  with config:

config: dataCoord: compaction: clustering: autoEnable: true enableActiveStandby: true indexCoord: enableActiveStandby: true log: level: debug queryCoord: enableActiveStandby: true queryNode: levelZeroForwardPolicy: RemoteLoad rootCoord: enableActiveStandby: true trace: exporter: jaeger jaeger: url: http://tempo-distributor.tempo:14268/api/traces sampleFraction: 1


### test steps
1. create a collection `fouram_j7MgI6d5` with 3 fields: id(varchar pk with max_length=64) + float_vector(128dim) + int64_ck (clustering_key)

{'auto_id': False, 'description': '', 'fields': [{'name': 'id', 'description': '', 'type': <DataType.VARCHAR: 21>, 'params': {'max_length': 64}, 'is_primary': True, 'auto_id': False}, {'name': 'float_vector', 'description': '', 'type': <DataType.FLOAT_VECTOR: 101>, 'params': {'dim': 128}}, {'name': 'int64_ck', 'description': '', 'type': <DataType.INT64: 5>, 'is_clustering_key': True}], 'enable_dynamic_field': False}

2.  create index for vectors
3. insert 30m entities, their pk columns are all 64 in length
4. flush -> index again -> load
5. concurrent requests: query + search + upsert
![image](https://github.com/user-attachments/assets/6155038b-5697-4328-87cf-44b4da6f477e)

### results
1. 40% search and 40% query failed

[2024-09-30 10:42:54,563 - ERROR - fouram]: (api_response) : [Collection.search] <_InactiveRpcError of RPC that terminated with: status = StatusCode.DEADLINE_EXCEEDED details = "Deadline Exceeded" debug_error_string = "UNKNOWN:Error received from peer {created_time:"2024-09-30T10:42:54.554845202+00:00", grpc_status:4, grpc_message:"Deadline Exceeded"}"

, [requestId: 7844132a-7f18-11ef-a6ba-5243da83fde5] (api_request.py:57)

[2024-09-30 10:44:55,398 - ERROR - fouram]: (api_response) : [Collection.query] <_MultiThreadedRendezvous of RPC that terminated with: status = StatusCode.DEADLINE_EXCEEDED details = "Deadline Exceeded" debug_error_string = "UNKNOWN:Error received from peer {grpc_message:"Deadline Exceeded", grpc_status:4, created_time:"2024-09-30T10:42:55.295962416+00:00"}"

, [requestId: 78b65854-7f18-11ef-a6ba-5243da83fde5] (api_request.py:57)
2. dataNode keeps oom, about 60 times
stats-varchar-cring-op-57-4369-milvus-datanode-74c684d448-wkgdt 1/1 Running 60 (140m ago) 7d21h 10.104.14.194 4am-node18
3. Clustering compaction seems to be executed repeatedly
- I saw all flushed segments level is 2
![image](https://github.com/user-attachments/assets/89c9946b-108f-443e-af93-86ec9f9f588a)
- after a while, all flushed segment level turns to 1 ?
![image](https://github.com/user-attachments/assets/0a6e336a-8199-4ddd-9303-c35c29ce2beb)

wrong metrics, flushed-L1 keeps increasing, flushed-L2 is always negative.they are both not same with birdwatcher

--- Growing: 0, Sealed: 0, Flushed: 22, Dropped: 110
--- Small Segments: 0, row count: 0  Other Segments: 22, row count: 31185000
--- Total Segments: 22, row count: 31185000

Expected Behavior

No response

Steps To Reproduce

https://argo-workflows.zilliz.cc/archived-workflows/qa/6797e23d-7b9f-45c5-9bdd-84ed8c2b67b9?nodeId=zong-stats-clustering-24h-1

Milvus Log

pods:

stats-varchar-cring-op-57-4369-etcd-0                             1/1     Running     0                7d21h   10.104.33.40    4am-node36   <none>           <none>
stats-varchar-cring-op-57-4369-etcd-1                             1/1     Running     0                7d21h   10.104.20.29    4am-node22   <none>           <none>
stats-varchar-cring-op-57-4369-etcd-2                             1/1     Running     0                7d21h   10.104.34.242   4am-node37   <none>           <none>
stats-varchar-cring-op-57-4369-milvus-datanode-74c684d448-wkgdt   1/1     Running     60 (140m ago)    7d21h   10.104.14.194   4am-node18   <none>           <none>
stats-varchar-cring-op-57-4369-milvus-indexnode-8588485b47q62wp   1/1     Running     0                7d21h   10.104.33.55    4am-node36   <none>           <none>
stats-varchar-cring-op-57-4369-milvus-indexnode-8588485b47sp9zq   1/1     Running     0                7d21h   10.104.13.212   4am-node16   <none>           <none>
stats-varchar-cring-op-57-4369-milvus-mixcoord-5bfc78b85-28zk8    1/1     Running     0                7d21h   10.104.5.135    4am-node12   <none>           <none>
stats-varchar-cring-op-57-4369-milvus-proxy-5c598886-g8jj8        1/1     Running     0                7d21h   10.104.20.45    4am-node22   <none>           <none>
stats-varchar-cring-op-57-4369-milvus-querynode-0-6ccbc88ffdzmx   1/1     Running     0                7d21h   10.104.18.27    4am-node25   <none>           <none>
stats-varchar-cring-op-57-4369-milvus-querynode-0-6ccbc88fwj55m   1/1     Running     0                7d21h   10.104.34.250   4am-node37   <none>           <none>
stats-varchar-cring-op-57-4369-minio-0                            1/1     Running     0                7d21h   10.104.33.42    4am-node36   <none>           <none>
stats-varchar-cring-op-57-4369-minio-1                            1/1     Running     0                7d21h   10.104.34.243   4am-node37   <none>           <none>
stats-varchar-cring-op-57-4369-minio-2                            1/1     Running     0                7d21h   10.104.18.22    4am-node25   <none>           <none>
stats-varchar-cring-op-57-4369-minio-3                            1/1     Running     0                7d21h   10.104.20.30    4am-node22   <none>           <none>
stats-varchar-cring-op-57-4369-pulsar-bookie-0                    1/1     Running     0                7d21h   10.104.34.244   4am-node37   <none>           <none>
stats-varchar-cring-op-57-4369-pulsar-bookie-1                    1/1     Running     0                7d21h   10.104.18.24    4am-node25   <none>           <none>
stats-varchar-cring-op-57-4369-pulsar-bookie-2                    1/1     Running     0                7d21h   10.104.33.46    4am-node36   <none>           <none>
stats-varchar-cring-op-57-4369-pulsar-bookie-init-q7xq7           0/1     Completed   0                7d21h   10.104.13.195   4am-node16   <none>           <none>
stats-varchar-cring-op-57-4369-pulsar-broker-0                    1/1     Running     0                7d21h   10.104.1.99     4am-node10   <none>           <none>
stats-varchar-cring-op-57-4369-pulsar-proxy-0                     1/1     Running     0                7d21h   10.104.20.26    4am-node22   <none>           <none>
stats-varchar-cring-op-57-4369-pulsar-pulsar-init-pzwsw           0/1     Completed   0                7d21h   10.104.1.98     4am-node10   <none>           <none>
stats-varchar-cring-op-57-4369-pulsar-recovery-0                  1/1     Running     0                7d21h   10.104.13.196   4am-node16   <none>           <none>
stats-varchar-cring-op-57-4369-pulsar-zookeeper-0                 1/1     Running     0                7d21h   10.104.33.43    4am-node36   <none>           <none>
stats-varchar-cring-op-57-4369-pulsar-zookeeper-1                 1/1     Running     0                7d21h   10.104.34.246   4am-node37   <none>           <none>
stats-varchar-cring-op-57-4369-pulsar-zookeeper-2                 1/1     Running     0                7d21h   10.104.20.33    4am-node22   <none>           <none>

Anything else?

No response

xiaofan-luan commented 3 days ago

/assign @wayblink

please help on this

xiaocai2333 commented 3 days ago

Two issues were discovered:

After the stats task, the segment's level was set to L1, which caused the segments resulting from clustering compaction to also be L1. This led to continuous triggering of clustering compaction.
It appears that there is a memory leak in clustering compaction. In the image below, the two boxes on the left and right indicate the times when the datanode was restarted. Two clustering compactions were performed in between, and the resident memory of the datanode increased.

74f4e976-002a-4627-b63d-834b4844a06e

xiaofan-luan commented 2 days ago

@xiaocai2333 @czs007 I think there is a conflict right now. When stats task triggered, the segment id changed. all the partition info is out dated (Because they record segment id info)

xiaocai2333 commented 2 days ago

Based on the current situation, the state here should not be L2 but should be L1. It was discovered that an error occurred in resetting the segment level after the clustering compaction failed.

milvus-io / milvus