[Feature]: List the unique values of the partition key field

ashkrisk commented 2 months ago

Is there an existing issue for this?

[X] I have searched the existing issues

Is your feature request related to a problem? Please describe.

We'd like to be able implement partition-key based multi-tenancy as suggested in this document: https://milvus.io/docs/multi_tenancy.md. One of the requirements is to keep track of the amount of rows utilized by any given tenant, for auditing purposes and to get an estimate of the amount of resources used per tenant.

One way to do this in Milvus is to run a count(*) query, with a filter on the partition key field:

collection.query(expr='tenant in ["my_tenant"]', output_fields=['count(*)'])

However, this method is only applicable if all the unique values in the partition key field are known beforehand (or stored in an external database).

It would be a significantly better user experience if there was a way to list the unique values of the partition key directly from Milvus itself, and avoid the need to be in sync with an external database.

Describe the solution you'd like.

There should be a Milvus API (or a modification to an existing API) that allows one to list the unique values of the partition key field. Even better would be do generalize the unique aggregation function so that it can be used with any field, not just the partition key.

Describe an alternate solution.

No response

Anything else? (Additional Context)

No response

xiaofan-luan commented 2 months ago

seems to be a valid requirement.

/assign @congqixia might be able to help on it

ashkrisk commented 2 months ago

@xiaofan-luan @congqixia if no one has started looking into this, can I pick this up?

xiaofan-luan commented 2 months ago

@xiaofan-luan @congqixia if no one has started looking into this, can I pick this up?

sure pls. This seems to be a difficult one, maybe we can setup a meeting between congqi and you to start

ashkrisk commented 2 months ago

Sounds good, thanks! I've been going through the relevant code, would be great to have a meeting in 2-3 days.

/assign @ashkrisk

congqixia commented 2 months ago

I am glad to help /assign @congqixia

ashkrisk commented 1 month ago

I've taken some time to go through the Query code path in Milvus I have a decent idea about how things work. I think we can create a general interface for aggregation functions - in this case the distinct() function, but later on could be extended to min(), max() etc

Here's a rough outline of the changes I plan to make:

In proxy queryTask.PreExecute currently checks the ouptut_fields parameter of the query request and decides to create either a "count" plan or an ordinary retrieve plan. Now it will generate plans with aggregation functions as well - QueryPlanNode and RetrieveRequest will be modified to add a string field which stores the name of the aggregation function.

A validation function will be called to make sure the input data type is supported by the aggregation function
A new reducer will be added for each aggregation function

In querynode

A new implementation of internalReducer and segCoreReducer for each aggregation function
In segcore:
- RetrivePlanNode needs to be modified to include a reference to the aggregation function. This could be either a string or a base class pointer
- The actual execution logic needs to be changed in SegmentInternalInterface::Retrieve and ExecPlanNodeVisitor. I still need to decide the best way to achieve this in a modular fashion.

@congqixia does this sound about right? I've emailed you to set up a meeting and discuss this further.

xiaofan-luan commented 1 month ago

amazing, I though you get most of the idea. to implement this function, you need to think about:

iterate over the partitonkey field, do an groupby for each segemnt
do a gathering on querynode level (segCoreReducer, cpp code)
do a gathering on delegator (Golang code)
finally, based on whether user want to list the result or count the number, proxy has to another gather and get the final result

milvus-io / milvus