Open yhmo opened 3 months ago
@bigsheeper It's been 2 weeks since this issue got opened and I know your hands are tied with some other urgent work.If there are no objections, I would like to take this up
/assign @nish112022 please let me know if you need any help
Thanks @xiaofan-luan , will let you know
Apologies for the delay, I got caught up with some priority tasks.
I was going through the files and what I found is that:
For Batch insert ->The validation happens in pymilvus itself in _parse_batch_request() which calls check_str_arr() from entity_helper.py and we get the error stating "invalid input, length of string exceeds max length. length: x, max length: y" where x>y.
We can't do the same thing for bulk insert as we make import request using proto-buf.So, while going through the code I find readFileStat in internal/datanode/importv2/scheduler.go. I think I can make some changes where the reader reads the data to validate the varchar length.Then I can take it up for Array field.What do you think @xiaofan-luan?Please let me know if there is some other better approach in your mind.
Apologies for the delay, I got caught up with some priority tasks.
I was going through the files and what I found is that:
- For Batch insert ->The validation happens in pymilvus itself in _parse_batch_request() which calls check_str_arr() from entity_helper.py and we get the error stating "invalid input, length of string exceeds max length. length: x, max length: y" where x>y.
- We can't do the same thing for bulk insert as we make import request using proto-buf.So, while going through the code I find readFileStat in internal/datanode/importv2/scheduler.go. I think I can make some changes where the reader reads the data to validate the varchar length.Then I can take it up for Array field.What do you think @xiaofan-luan?Please let me know if there is some other better approach in your mind.
Hi @nish112022 , Thank you for your interest in contributing to our project!
For better performance, I suggest doing the checks at the reader level (like in the parseEntity function for the JSON reader). You can add a public check function in util.go for each reader(json, parquet, numpy, binlog).
If there's anything you'd like to discuss, please let me know.
Sure, I will start making changes and let you know if there are any doubts.
I have made changes for json, parquet, numpy. I am facing a little difficulty with binlog reader.
I am using paramterutil.GetMaxLength()
to get the max length of the varchar field and for binlog I believe I need to make changes in internal/storage/payload_reader.go GetDataFromPayload()
but since binlog-reader doesn't have the field field *schemapb.FieldSchema)
. I might need to modify the whole bin log-reader structure to accommodate this change or I add a check under importutilv2/binlog/field_reader.go Next() but this will mean going through the. rows again 1 by 1.What do you think @bigsheeper ?
I have made changes for json, parquet, numpy. I am facing a little difficulty with binlog reader.
I am using
paramterutil.GetMaxLength()
to get the max length of the varchar field and for binlog I believe I need to make changes ininternal/storage/payload_reader.go GetDataFromPayload()
but since binlog-reader doesn't have the fieldfield *schemapb.FieldSchema)
. I might need to modify the whole bin log-reader structure to accommodate this change or I add a check under importutilv2/binlog/field_reader.go Next() but this will mean going through the. rows again 1 by 1.What do you think @bigsheeper ?
@nish112022 I think the binlog reader might not need this check since it was already done during the insertion. Thanks a lot for your contribution! I'll review your PR as soon as possible.
So, Let's say we have a collection named "AAA" Now I create another collection named "BBB", and create a partition named "kkk". And import the data of "AAA" to "BBB".
Does this mean, the only use case for bulk insert with binlog will be when structure of "BBB" will be identical to "AAA" and otherwise there should not be any bulk-insert?
So, Let's say we have a collection named "AAA" Now I create another collection named "BBB", and create a partition named "kkk". And import the data of "AAA" to "BBB".
Does this mean, the only use case for bulk insert with binlog will be when structure of "BBB" will be identical to "AAA" and otherwise there should not be any bulk-insert?
Yes, that's the case for binlog import.
Hello @nish112022 ~
Thank you for your pull request. I've reviewed the changes and left some minor suggestions. Could you please take a look? We look forward to seeing the updated version.Thanks again for your contribution!
Best regards.
Thanks for your suggestions @bigsheeper. Made the changes as per the feedback.
Thanks for your suggestions @bigsheeper. Made the changes as per the feedback.
@nish112022 Thank you for your work on the PR! We've successfully merged it into the main branch. To ensure this feature is included in our upcoming release, could you please help to cherry-pick the commit onto the 2.4
branch? This will help us prepare everything for the release. If you encounter any issues or need assistance, feel free to reach out.
@bigsheeper I have cherry-picked this commit and raised a new PR for 2.4 branch.
@bigsheeper I have cherry-picked this commit and raised a new PR for 2.4 branch.
@nish112022 Your picked pull request has been successfully merged! 🎉 We're planning to include it in the upcoming v2.4.10
release.
Thank you for your valuable contribution. We appreciate your effort and look forward to collaborating with you again in the future.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
Rotten issues close after 30d of inactivity. Reopen the issue with /reopen
.
Is there an existing issue for this?
Environment
Current Behavior
Currently, by bulkinsert interface, users can import a file containing long string whose length exceeds max_length. This is unexpected.
And, Array field's max_capacity is not validated.
Expected Behavior
max_length/max_capacity should be validated during bulkinsert.
Steps To Reproduce
No response
Milvus Log
No response
Anything else?
No response