milvus-io / milvus

A cloud-native vector database, storage for next generation AI applications
https://milvus.io
Apache License 2.0
30.75k stars 2.93k forks source link

[Bug]: Although the collections have been dropped, the datanode is still conduming the inserted data resulting in oomkilled #23154

Closed ThreadDao closed 1 year ago

ThreadDao commented 1 year ago

Is there an existing issue for this?

Environment

- Milvus version: v2.2.4
- Deployment mode(standalone or cluster):cluster
- MQ type(rocksmq, pulsar or kafka):pulsar    
- SDK version(e.g. pymilvus v2.0.0rc2): 2.3.0.dev34
- OS(Ubuntu or CentOS): 
- CPU/Memory:
- GPU: 
- Others:

Current Behavior

  1. deploy milvus with the folowing config and dataNode resources:
    components:
    dataCoord:
      paused: false
      replicas: 1
    dataNode:
      paused: false
      replicas: 1
      resources:
        limits:
          cpu: "2"
          memory: 12Gi
        requests:
          cpu: 100m
          memory: 128Mi
    config:
    dataNode:
      memory:
        forceSyncEnable: false
    log:
      level: debug
  2. Concurrent create partition and insert data concurrent create 100~500 partitions, and insert data without flush. In the end, due to the high datanode memory and it was forbidden to insert
  3. after a while, drop all collections. but datanode memory stills high
  4. delete datanode pod manually (I thought the datanode was taking up memory due to too many unsynced segments
  5. datanode restarted and has been restarting due to oomkilled

memory: image

datanode pprof: pprof001

Expected Behavior

Don't take up memroy due to dropped collections

Steps To Reproduce

No response

Milvus Log

pods in devops cluster chaos-testing namespace:

disable-sync-etcd-0                                               1/1     Running             7 (4d ago)       4d      10.102.6.63     devops-node10   <none>           <none>
disable-sync-etcd-1                                               1/1     Running             1 (4d ago)       4d      10.102.10.136   devops-node20   <none>           <none>
disable-sync-etcd-2                                               1/1     Running             7 (4d ago)       4d      10.102.9.67     devops-node13   <none>           <none>
disable-sync-milvus-datacoord-7b5fd6d467-6br5p                    1/1     Running             1 (10h ago)      4d      10.102.10.140   devops-node20   <none>           <none>
disable-sync-milvus-datanode-5c75d86ff6-qr649                     1/1     Running             98 (29s ago)     15h     10.102.6.96     devops-node10   <none>           <none>
disable-sync-milvus-indexcoord-585548785c-q9wcm                   1/1     Running             0                4d      10.102.10.142   devops-node20   <none>           <none>
disable-sync-milvus-indexnode-869c74b554-cs6qd                    1/1     Running             1                4d      10.102.10.145   devops-node20   <none>           <none>
disable-sync-milvus-proxy-6bd895b7cd-p4cb6                        1/1     Running             1 (40h ago)      4d      10.102.10.148   devops-node20   <none>           <none>
disable-sync-milvus-querycoord-88c4bf645-g8psj                    1/1     Running             0                4d      10.102.10.147   devops-node20   <none>           <none>
disable-sync-milvus-querynode-7848d7f59f-pl27j                    1/1     Running             0                4d      10.102.10.144   devops-node20   <none>           <none>
disable-sync-milvus-rootcoord-6d7d4f44bc-24dbv                    1/1     Running             0                4d      10.102.10.146   devops-node20   <none>           <none>
disable-sync-minio-0                                              1/1     Running             0                4d      10.102.9.62     devops-node13   <none>           <none>
disable-sync-minio-1                                              1/1     Running             0                4d      10.102.6.64     devops-node10   <none>           <none>
disable-sync-minio-2                                              1/1     Running             0                4d      10.102.7.13     devops-node11   <none>           <none>
disable-sync-minio-3                                              1/1     Running             0                4d      10.102.10.122   devops-node20   <none>           <none>
disable-sync-pulsar-bookie-0                                      1/1     Running             0                4d      10.102.9.71     devops-node13   <none>           <none>
disable-sync-pulsar-bookie-1                                      1/1     Running             0                4d      10.102.7.247    devops-node11   <none>           <none>
disable-sync-pulsar-bookie-2                                      1/1     Running             0                4d      10.102.6.85     devops-node10   <none>           <none>
disable-sync-pulsar-bookie-init-l5jsp                             0/1     Completed           0                4d      10.102.9.60     devops-node13   <none>           <none>
disable-sync-pulsar-broker-0                                      1/1     Running             0                4d      10.102.7.62     devops-node11   <none>           <none>
disable-sync-pulsar-proxy-0                                       1/1     Running             0                4d      10.102.6.61     devops-node10   <none>           <none>
disable-sync-pulsar-pulsar-init-skc45                             0/1     Completed           0                4d      10.102.6.50     devops-node10   <none>           <none>
disable-sync-pulsar-recovery-0                                    1/1     Running             0                4d      10.102.6.60     devops-node10   <none>           <none>
disable-sync-pulsar-zookeeper-0                                   1/1     Running             0                4d      10.102.6.71     devops-node10   <none>           <none>
disable-sync-pulsar-zookeeper-1                                   1/1     Running             0                4d      10.102.9.75     devops-node13   <none>           <none>
disable-sync-pulsar-zookeeper-2                                   1/1     Running             0                4d      10.102.7.133    devops-node11   <none>           <none>

Anything else?

No response

ThreadDao commented 1 year ago

/assign @XuanYang-cn

XuanYang-cn commented 1 year ago

/assign @ThreadDao /unassign

Please set forceSyncEnable==true and retry

yanliang567 commented 1 year ago

@XuanYang-cn should not we ignore the messages in MQ as the collections were dropped already?

yanliang567 commented 1 year ago

/assign @XuanYang-cn /unassign

xiaofan-luan commented 1 year ago

@XuanYang-cn should not we ignore the messages in MQ as the collections were dropped already?

I thought there is something wrong here. we are fucked up in the data clean and we need to fixed that ASAP

stale[bot] commented 1 year ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. Rotten issues close after 30d of inactivity. Reopen the issue with /reopen.

stale[bot] commented 1 year ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. Rotten issues close after 30d of inactivity. Reopen the issue with /reopen.