When using the Milvus service for writing to a vector database, the performance drops when using small batch sizes or infrequent writes. This is because the service wants to reindex the database after each message, or after a set time has elapsed (it is hard coded to 3 seconds). This is inefficient for a few reasons:
When uploading infrequently, if the reindex takes ~3 seconds, then you can get into a loop where you: add 1 message -> reindex (3 sec) -> add 1 message -> reindex (3 sec). This causes messages to back up and the service cannot keep up.
When uploading frequently with large batch sizes, reindexing can be triggered by the number of rows. This again can cause issues because the index can take longer than data is generated.
Ideally, we would use something similar to a debounce to update the index. So reindexing only occurs after some set time where no messages have been added.
Minimum reproducible example
milvus_service = MilvusVectorDBService(uri=milvus_server_uri)
# Create the collection
...
# Make a small dataframe with 5 rows
df = cudf.DataFrame({
"id": list(range(num_input_rows)),
"age": [random.randint(20, 40) for i in range(num_input_rows)],
"embedding": [[random.random() for _ in range(3)] for _ in range(num_input_rows)]
})
# Add the rows to the collection in a loop
for _ in range(10000):
milvus_service.insert_dataframe(collection_name, df)
# Sleep some amount to allow the data to be inserted (this may need to be tweaked to trigger the bug)
time.sleep(0.1)
Relevant log output
Click here to see error details
[Paste the error here, it will be hidden by default]
Full env printout
Click here to see environment details
[Paste the results of print_env.sh here, it will be hidden by default]
Other/Misc.
No response
Code of Conduct
[X] I agree to follow Morpheus' Code of Conduct
[X] I have searched the open bugs and have found no duplicates for this bug report
Version
24.03
Which installation method(s) does this occur on?
Docker, Conda, Source
Describe the bug.
When using the Milvus service for writing to a vector database, the performance drops when using small batch sizes or infrequent writes. This is because the service wants to reindex the database after each message, or after a set time has elapsed (it is hard coded to 3 seconds). This is inefficient for a few reasons:
Ideally, we would use something similar to a debounce to update the index. So reindexing only occurs after some set time where no messages have been added.
Minimum reproducible example
Relevant log output
Click here to see error details
Full env printout
Click here to see environment details
Other/Misc.
No response
Code of Conduct