lmcinnes / pynndescent

A Python nearest neighbor descent for approximate nearest neighbors
BSD 2-Clause "Simplified" License
879 stars 105 forks source link

self implemented distance metric #200

Open jianshu93 opened 2 years ago

jianshu93 commented 2 years ago

Dear Pynndescent team,

I am wondering whether there could be an option for this library to provide an API to allow external distance metric so that some unique distance metric can be used., e.g., for biological applications, distance computation was actually based on MinHash of genomic profiles, none of the provided metric is fast in this case.

Thanks,

Jianshu

lmcinnes commented 2 years ago

You can provide a custom metric if you wish as long as it can be jit compiled by numba. Just specify the function as the metric instead of a metric name as a string. So for example you might have

@numba.njit()
def custom_distance_function(a, b):
    ...

index = NNDescent(metric=custom_distance_function)

The biggest hurdle here is that numba supports a (relatively large) subset of python, but can't see inside other functions, so you will need your own implementation of the distance metric, you can't just call another library (unless they have jit compiled code, or it is a C library and you use CFFI in numba). Note that a custom metric may not be as fast as many of the pre-provided metrics since the latter have had a lot of optimization work go into them.

Lastly, if you are interested in using minhash, perhaps you could consider, instead, using jaccard on a sparse matrix representation of your data -- since minhash is really just an approximation of jaccard anyway, why not use the real thing? As long as you convert the data to scipy.sparse format it will be relatively fast, and may even be competitive with naive minhash implementations.

jianshu93 commented 2 years ago

hello Leland,

Thanks for the suggestion on MinHash, the problem for me is, the matrix representation of my data (dimension) is huge, 10^6 or above (and there are 10^6 of such data vector ), so to have fast approximation of Jaccard index, there are no choice for me right now but MinHash approximation. I understand the jit complied part and the MinHash software I am relying on is actually a C++ library. So it may be difficult to incorporate it or supported by jit since I am not family with JIT at all.

I am thinking kgraph C++ library here(https://github.com/aaalgo/kgraph) since Jaccard is a metric.

Thanks for your help.

Jianshu

lmcinnes commented 2 years ago

The ability to compute jaccard using sparse structure is dependent on the number of non-zeros per row, not on the actual dimension. So 10^6 dimensions might be fine if most of the data are zeros. This is very often the case for data with that many dimensions.

If you still need to use minhash then as long as you can import the function via ctypes then you should be able to use it within a numba jitted function -- you'll just have to be careful with what you pass it.