milvus-io / milvus

A cloud-native vector database, storage for next generation AI applications
https://milvus.io
Apache License 2.0
31.01k stars 2.95k forks source link

[Bug]: Indexnode momey increased even all the data in the collection are expired with TTL and no index task issued #26009

Closed yanliang567 closed 1 year ago

yanliang567 commented 1 year ago

Is there an existing issue for this?

Environment

- Milvus version:  master-20230725-a1321223
- Deployment mode(standalone or cluster): cluster
- MQ type(rocksmq, pulsar or kafka):    pulsar
- SDK version(e.g. pymilvus v2.0.0rc2):

Current Behavior

Index node memory increased even all the data are expired with TTL and no index task issued. image

Expected Behavior

index node memory released or at least it should not increased again as there is no data or tasks issued.

Steps To Reproduce

1. create a collection with TTL 24h
2. insert 4m-256d data
3. build index  (index building and compaction will be in a loop which caused minio disk usage keeps increasing, see #25955)
4. wait for 48h and check

Milvus Log

pod names on 4am cluster

yanliang-disk22-etcd-0                                            1/1     Running            0                 7h31m   10.104.13.117   4am-node16   <none>           <none>
yanliang-disk22-etcd-1                                            1/1     Running            0                 7h31m   10.104.12.121   4am-node17   <none>           <none>
yanliang-disk22-etcd-2                                            1/1     Running            0                 7h31m   10.104.6.231    4am-node13   <none>           <none>
yanliang-disk22-milvus-datanode-d94bcfbcf-7bfsp                   1/1     Running            0                 7h27m   10.104.9.206    4am-node14   <none>           <none>
yanliang-disk22-milvus-indexnode-64c5b78c75-w9787                 1/1     Running            0                 7h27m   10.104.9.213    4am-node14   <none>           <none>
yanliang-disk22-milvus-mixcoord-5bbcd54bd7-4fswm                  1/1     Running            0                 7h27m   10.104.9.211    4am-node14   <none>           <none>
yanliang-disk22-milvus-proxy-6bb58fb87c-l6qcl                     1/1     Running            0                 7h27m   10.104.9.208    4am-node14   <none>           <none>
yanliang-disk22-milvus-querynode-5bd446d687-bzgnw                 1/1     Running            0                 7h27m   10.104.9.209    4am-node14   <none>           <none>
yanliang-disk22-minio-0                                           1/1     Running            0                 7h31m   10.104.13.118   4am-node16   <none>           <none>
yanliang-disk22-minio-1                                           1/1     Running            0                 7h31m   10.104.6.234    4am-node13   <none>           <none>
yanliang-disk22-minio-2                                           1/1     Running            0                 7h31m   10.104.12.123   4am-node17   <none>           <none>
yanliang-disk22-minio-3                                           1/1     Running            0                 7h31m   10.104.5.118    4am-node12   <none>           <none>
yanliang-disk22-pulsar-bookie-0                                   1/1     Running            0                 7h31m   10.104.13.121   4am-node16   <none>           <none>
yanliang-disk22-pulsar-bookie-1                                   1/1     Running            0                 7h31m   10.104.23.11    4am-node27   <none>           <none>
yanliang-disk22-pulsar-bookie-2                                   1/1     Running            0                 7h31m   10.104.4.176    4am-node11   <none>           <none>
yanliang-disk22-pulsar-bookie-init-qlvgj                          0/1     Completed          0                 7h31m   10.104.13.113   4am-node16   <none>           <none>
yanliang-disk22-pulsar-broker-0                                   1/1     Running            0                 7h31m   10.104.6.229    4am-node13   <none>           <none>
yanliang-disk22-pulsar-proxy-0                                    1/1     Running            0                 7h31m   10.104.12.119   4am-node17   <none>           <none>
yanliang-disk22-pulsar-pulsar-init-drvhf                          0/1     Completed          0                 7h31m   10.104.13.114   4am-node16   <none>           <none>
yanliang-disk22-pulsar-recovery-0                                 1/1     Running            0                 7h31m   10.104.13.115   4am-node16   <none>           <none>
yanliang-disk22-pulsar-zookeeper-0                                1/1     Running            0                 7h31m   10.104.12.127   4am-node17   <none>           <none>
yanliang-disk22-pulsar-zookeeper-1                                1/1     Running            0                 7h31m   10.104.5.122    4am-node12   <none>           <none>
yanliang-disk22-pulsar-zookeeper-2                                1/1     Running            0                 7h30m   10.104.14.241   4am-node18   <none>           <none>

Anything else?

image

yanliang567 commented 1 year ago

/assign @jiaoew1991 /unassign

jiaoew1991 commented 1 year ago

/assign @XuanYang-cn /unassign

Many users have recently encountered the issue of setting TTL but not seeing a decrease in storage.

xiaofan-luan commented 1 year ago

/assign @XuanYang-cn /unassign

Many users have recently encountered the issue of setting TTL but not seeing a decrease in storage.

I think this is related to compaction policy where we are trying to purge expired data too frequent. This should be fixed with the new gc policy but we could also do something on the compaction

XuanYang-cn commented 1 year ago

Found a permenent mem leak in DN Error in lock usage, DC get results before writing the task, see not plan in DN and executing tasks in DC, think it's failed and never call SyncSegments, causing DN memory leak. image image

26032

XuanYang-cn commented 1 year ago

leak prof in DN image

XuanYang-cn commented 1 year ago

In my test, it takes 2 GC cycles(2hrs) to finsh GC all segments and reduce the minio disk usage.

The first GC removed all compactedTo segments with 0 lines. The second GC removed all original segments with ttl data. Which might need to be improved later.

stale[bot] commented 1 year ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. Rotten issues close after 30d of inactivity. Reopen the issue with /reopen.