milvus-io / milvus

A cloud-native vector database, storage for next generation AI applications
https://milvus.io
Apache License 2.0
30.34k stars 2.91k forks source link

[Feature]: Support UDF Metric Distance #23112

Open liliu-z opened 1 year ago

liliu-z commented 1 year ago

Is there an existing issue for this?

Is your feature request related to a problem? Please describe.

See https://github.com/milvus-io/milvus/discussions/23045

Describe the solution you'd like.

No response

Describe an alternate solution.

No response

Anything else? (Additional Context)

No response

liliu-z commented 1 year ago

/assign @liliu-z

liliu-z commented 1 year ago

/unassign @xiaofan-luan

BennySemyonovAB commented 6 months ago

Hi, any progress made? Our company came to the same needs as mentioned in https://github.com/milvus-io/milvus/discussions/23045

xiaofan-luan commented 6 months ago

@BennySemyonovAB

can you explain what is the use case for custimized distance a little bit? Do we know this distance before we build index?

RaphaelCanin commented 3 months ago

Hello there @xiaofan-luan ,

I currently face the same problem. The use case of such a thing would be models like SigLip where the similarity function has been trained as part of the model.

Two vectors X and Y are close when sigmoid(dot(X,Y)* t + b) is close to 1. I tried using the Inner Product metric but the results are quite bad.

For my part, the distance function is known before I build index. I typically would see something like this :

def custom_metric(x, y):
    return sigmoid(np.dot(x, y)*t + b)

index_params.add_index(
    field_name="vector",
    index_type="HNSW",
    metric_type=custom_metric
)

I don't know to what extent this is compatible with the Milvus code but i would be lifesaving.

Thanks in advance

xiaofan-luan commented 3 months ago

this seems to be not hard. but the hard part is how to do user define.

  1. this udf functions need to be deeply optimized by SIMD instruction and this seems to be a very high requirment.
  2. there are some security concerns. But anyway I thought this is a solid use case and @liliu-z can help on it
liliu-z commented 3 months ago

Hello there @xiaofan-luan ,

I currently face the same problem. The use case of such a thing would be models like SigLip where the similarity function has been trained as part of the model.

Two vectors X and Y are close when sigmoid(dot(X,Y)* t + b) is close to 1. I tried using the Inner Product metric but the results are quite bad.

For my part, the distance function is known before I build index. I typically would see something like this :

def custom_metric(x, y):
    return sigmoid(np.dot(x, y)*t + b)

index_params.add_index(
    field_name="vector",
    index_type="HNSW",
    metric_type=custom_metric
)

I don't know to what extent this is compatible with the Milvus code but i would be lifesaving.

Thanks in advance

Hi @RaphaelCanin

Thanks for this info. This is always in our roadmap, and it will great to listen more from the community to help us understand this necessity better.

A quick question. In your example, the relationship between IP and this customized metric is monotonic, which means the distance comparison result of IP and this customized metric will always be the same. Can I ask why this will help on this use case?

RaphaelCanin commented 3 months ago

Hi @xiaofan-luan, hi @liliu-z ,

Thanks for taking some time to help me. @liliu-z you are right, I have implemented a workaround as the distance comparison is monotonic. But, it is true because t > 0, which is only true for this particular one. With t < 0, the only workaround is to use the Furthest Neighbor Search instead, which is currently not available as far as I know.

For my current use case, I can use Milvus as is for the moment. However, when I need to search for a particular range of my custom score (eg. : between 50% and 60%), I need to compute the reciprocal of the above mentionned function (same eg. : between 0.1070889177 and 0.11051817).

It is not hard, but it makes the code unnecessarily long.

So, I am able to reach my goals with the current Milvus' features, but it would be simpler with an UDF Metric distance.

Thanks a lot

liliu-z commented 3 months ago

Hi @xiaofan-luan, hi @liliu-z ,

Thanks for taking some time to help me. @liliu-z you are right, I have implemented a workaround as the distance comparison is monotonic. But, it is true because t > 0, which is only true for this particular one. With t < 0, the only workaround is to use the Furthest Neighbor Search instead, which is currently not available as far as I know.

For my current use case, I can use Milvus as is for the moment. However, when I need to search for a particular range of my custom score (eg. : between 50% and 60%), I need to compute the reciprocal of the above mentionned function (same eg. : between 0.1070889177 and 0.11051817).

It is not hard, but it makes the code unnecessarily long.

So, I am able to reach my goals with the current Milvus' features, but it would be simpler with an UDF Metric distance.

Thanks a lot

Yes, this can help make the range related operation easier.

For t < 0, not sure whether we can implement it though using -IP as metrics type.

Anyways, thanks for this use case sharing! We are keeping looking for more cases to help us define this UDF feature better!