milvus-io / milvus

A cloud-native vector database, storage for next generation AI applications
https://milvus.io
Apache License 2.0
29.98k stars 2.87k forks source link

[Bug]: Varchar field's max_length and Array field's max_capacity is not validated during bulkinsert #34150

Open yhmo opened 3 months ago

yhmo commented 3 months ago

Is there an existing issue for this?

Environment

- Milvus version: 2.4.4
- Deployment mode(standalone or cluster):
- MQ type(rocksmq, pulsar or kafka):    
- SDK version(e.g. pymilvus v2.0.0rc2):
- OS(Ubuntu or CentOS): 
- CPU/Memory: 
- GPU: 
- Others:

Current Behavior

Currently, by bulkinsert interface, users can import a file containing long string whose length exceeds max_length. This is unexpected.

And, Array field's max_capacity is not validated.

Expected Behavior

max_length/max_capacity should be validated during bulkinsert.

Steps To Reproduce

No response

Milvus Log

No response

Anything else?

No response

nish112022 commented 3 months ago

@bigsheeper It's been 2 weeks since this issue got opened and I know your hands are tied with some other urgent work.If there are no objections, I would like to take this up

xiaofan-luan commented 3 months ago

/assign @nish112022 please let me know if you need any help

nish112022 commented 3 months ago

Thanks @xiaofan-luan , will let you know

nish112022 commented 2 months ago

Apologies for the delay, I got caught up with some priority tasks.

I was going through the files and what I found is that:

  1. For Batch insert ->The validation happens in pymilvus itself in _parse_batch_request() which calls check_str_arr() from entity_helper.py and we get the error stating "invalid input, length of string exceeds max length. length: x, max length: y" where x>y.

  2. We can't do the same thing for bulk insert as we make import request using proto-buf.So, while going through the code I find readFileStat in internal/datanode/importv2/scheduler.go. I think I can make some changes where the reader reads the data to validate the varchar length.Then I can take it up for Array field.What do you think @xiaofan-luan?Please let me know if there is some other better approach in your mind.

bigsheeper commented 2 months ago

Apologies for the delay, I got caught up with some priority tasks.

I was going through the files and what I found is that:

  1. For Batch insert ->The validation happens in pymilvus itself in _parse_batch_request() which calls check_str_arr() from entity_helper.py and we get the error stating "invalid input, length of string exceeds max length. length: x, max length: y" where x>y.
  2. We can't do the same thing for bulk insert as we make import request using proto-buf.So, while going through the code I find readFileStat in internal/datanode/importv2/scheduler.go. I think I can make some changes where the reader reads the data to validate the varchar length.Then I can take it up for Array field.What do you think @xiaofan-luan?Please let me know if there is some other better approach in your mind.

Hi @nish112022 , Thank you for your interest in contributing to our project!

For better performance, I suggest doing the checks at the reader level (like in the parseEntity function for the JSON reader). You can add a public check function in util.go for each reader(json, parquet, numpy, binlog).

If there's anything you'd like to discuss, please let me know.

nish112022 commented 2 months ago

Sure, I will start making changes and let you know if there are any doubts.

nish112022 commented 2 months ago

I have made changes for json, parquet, numpy. I am facing a little difficulty with binlog reader.

I am using paramterutil.GetMaxLength() to get the max length of the varchar field and for binlog I believe I need to make changes in internal/storage/payload_reader.go GetDataFromPayload() but since binlog-reader doesn't have the field field *schemapb.FieldSchema). I might need to modify the whole bin log-reader structure to accommodate this change or I add a check under importutilv2/binlog/field_reader.go Next() but this will mean going through the. rows again 1 by 1.What do you think @bigsheeper ?

bigsheeper commented 2 months ago

I have made changes for json, parquet, numpy. I am facing a little difficulty with binlog reader.

I am using paramterutil.GetMaxLength() to get the max length of the varchar field and for binlog I believe I need to make changes in internal/storage/payload_reader.go GetDataFromPayload() but since binlog-reader doesn't have the field field *schemapb.FieldSchema). I might need to modify the whole bin log-reader structure to accommodate this change or I add a check under importutilv2/binlog/field_reader.go Next() but this will mean going through the. rows again 1 by 1.What do you think @bigsheeper ?

@nish112022 I think the binlog reader might not need this check since it was already done during the insertion. Thanks a lot for your contribution! I'll review your PR as soon as possible.

nish112022 commented 2 months ago

So, Let's say we have a collection named "AAA" Now I create another collection named "BBB", and create a partition named "kkk". And import the data of "AAA" to "BBB".

Does this mean, the only use case for bulk insert with binlog will be when structure of "BBB" will be identical to "AAA" and otherwise there should not be any bulk-insert?

bigsheeper commented 2 months ago

So, Let's say we have a collection named "AAA" Now I create another collection named "BBB", and create a partition named "kkk". And import the data of "AAA" to "BBB".

Does this mean, the only use case for bulk insert with binlog will be when structure of "BBB" will be identical to "AAA" and otherwise there should not be any bulk-insert?

Yes, that's the case for binlog import.

bigsheeper commented 1 month ago

Hello @nish112022 ~

Thank you for your pull request. I've reviewed the changes and left some minor suggestions. Could you please take a look? We look forward to seeing the updated version.Thanks again for your contribution!

Best regards.

nish112022 commented 1 month ago

Thanks for your suggestions @bigsheeper. Made the changes as per the feedback.

bigsheeper commented 1 month ago

Thanks for your suggestions @bigsheeper. Made the changes as per the feedback.

@nish112022 Thank you for your work on the PR! We've successfully merged it into the main branch. To ensure this feature is included in our upcoming release, could you please help to cherry-pick the commit onto the 2.4 branch? This will help us prepare everything for the release. If you encounter any issues or need assistance, feel free to reach out.

nish112022 commented 1 month ago

@bigsheeper I have cherry-picked this commit and raised a new PR for 2.4 branch.

bigsheeper commented 1 month ago

@bigsheeper I have cherry-picked this commit and raised a new PR for 2.4 branch.

@nish112022 Your picked pull request has been successfully merged! 🎉 We're planning to include it in the upcoming v2.4.10 release. Thank you for your valuable contribution. We appreciate your effort and look forward to collaborating with you again in the future.

stale[bot] commented 2 weeks ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. Rotten issues close after 30d of inactivity. Reopen the issue with /reopen.