milvus-io / milvus

A cloud-native vector database, storage for next generation AI applications
https://milvus.io
Apache License 2.0
28.98k stars 2.79k forks source link

[Bug]: The DiskQuota limit is inaccurate, and exceeding the DiskQuota will result in a failure during recovery. #33775

Open lentitude2tk opened 2 months ago

lentitude2tk commented 2 months ago

Is there an existing issue for this?

Environment

- Milvus version: 2.4
- Deployment mode(standalone or cluster): standalone
- MQ type(rocksmq, pulsar or kafka):    
- SDK version(e.g. pymilvus v2.0.0rc2):
- OS(Ubuntu or CentOS): 
- CPU/Memory: 
- GPU: 
- Others:

Current Behavior

  1. Based on discussions with the kernel team, if a user continually writes data to Free under the current strategy,
  2. The data already written will be restricted by the diskQuota, but some data might still be in the message queue, resulting in the total consumed data exceeding the diskQuota.
  3. If a backup is made and then restored at this point,
  4. The bulkInsert operation will fail during restoration, indicating that the diskQuota has been exceeded and recovery is not possible.

Expected Behavior

If the user's data can be written normally, the backup and subsequent recovery should also work properly. Here are some potential solutions for reference, with the specifics open to kernel discussion:

  1. When the data reaches the diskQuota, discard the remaining data in the consumption queue without processing it.
  2. When the data reaches a certain proportion of the diskQuota (e.g., reserving 100MB or 5%), enforce the diskQuota limit. Ensure that the data in the message queue does not exceed the diskQuota.
  3. Do not restrict the size of the bulkInsert binlog. This way, even after recovery, the user's data will have reached the diskQuota, and the impact will be manageable.

Steps To Reproduce

1. The issue can be consistently reproduced by following steps 1-4.

Milvus Log

https://grafana.op.zillizcloud.com/explore?orgId=1&left=%7B%22datasource%22:%22Loki%22,%22queries%22:%5B%7B%22refId%22:%22A%22,%22expr%22:%22%7Bcluster%3D%5C%22milvus-gcp-us-west1-1%5C%22,namespace%3D%5C%22milvus-in01-631a262f0093e3a%5C%22,pod%3D~%5C%22in01-631a262f0093e3a-milvus-.*%5C%22%7D%7C%3D%5C%22450140497833660850%5C%22%22%7D%5D,%22range%22:%7B%22from%22:%22now-1h%22,%22to%22:%22now%22%7D%7D

Anything else?

No response

bigsheeper commented 2 months ago

/assign

yanliang567 commented 2 months ago

I'd close this issue for a zilliz cloud issue.

lentitude2tk commented 1 month ago

I'd close this issue for a zilliz cloud issue.

Why was this issue closed? This problem was discovered in the cloud, but the root cause is a series of issues caused by the kernel's inability to strictly enforce disk quota limits at the database level.

xiaofan-luan commented 1 month ago

seems that there is no quick fix for that?

if insertion is quota limited, then the data size is already more than a same memory cluster can load.

Unless for all backup request we remove the check

lentitude2tk commented 1 month ago

if insertion is quota limited, then the data size is already more than a same memory cluster can load.

Yes, as long as the user encounters this quota limited message, it means that the current data size has already exceeded the normal quota, which will affect subsequent processes.

Unless for all backup request we remove the check

In fact, the current operation is similar. When encountering such errors, especially during resume, we can only lift the restriction to allow the process to continue, but currently, this is done manually.

A simple solution currently considered is to expose a parameter in the bulkInsert Option that allows skipping validation. One benefit is that different restrictions can be applied for stop/resume and migrate. For normal user stop/resume operations, it should be ensured that data can be written normally and that the user can resume normally after stopping. However, during migrateFrom, restrictions should be applied to prevent users from writing a large amount of data exceeding the diskQuota into Milvus.

xiaofan-luan commented 1 month ago

if insertion is quota limited, then the data size is already more than a same memory cluster can load.

Yes, as long as the user encounters this quota limited message, it means that the current data size has already exceeded the normal quota, which will affect subsequent processes.

Unless for all backup request we remove the check

In fact, the current operation is similar. When encountering such errors, especially during resume, we can only lift the restriction to allow the process to continue, but currently, this is done manually.

A simple solution currently considered is to expose a parameter in the bulkInsert Option that allows skipping validation. One benefit is that different restrictions can be applied for stop/resume and migrate. For normal user stop/resume operations, it should be ensured that data can be written normally and that the user can resume normally after stopping. However, during migrateFrom, restrictions should be applied to prevent users from writing a large amount of data exceeding the diskQuota into Milvus.

Agree we can add a config to ignore the disk size check. comment? @bigsheeper

lentitude2tk commented 1 month ago

/open

bigsheeper commented 1 month ago

Agree we can add a config to ignore the disk size check. comment? @bigsheeper

Yep, I also think so.

xiaofan-luan commented 1 month ago

we should still enable disk quota limit

bigsheeper commented 2 weeks ago

We can add an option to the import request to skip the disk quota check, which would give us more flexibility. And we don't need to worry about cloud users bypassing the check by setting this option, as they don't have access to the bulkinsert interface.

bigsheeper commented 2 weeks ago

@lentitude2tk pr merged, this enhancement will be released at vertion 2.4.7

bigsheeper commented 2 weeks ago

/assign @lentitude2tk /unassign