nmslib / hnswlib

Header-only C++/python library for fast approximate nearest neighbors
https://github.com/nmslib/hnswlib
Apache License 2.0
4.34k stars 641 forks source link

User-defined distance functions for use in Python and retrieving distance records during item insertion #256

Open lxfhfut opened 3 years ago

lxfhfut commented 3 years ago

Hi, Thanks for the great work! I have two questions:

  1. I would like to define my own distance function for the query in Python. The document says "Can work with custom user-defined distances (C++)", so the user-defined distance is only supported for C++? I created a new header file, e.g. space_customized.h, in the 'hnswlib' folder and re-installed the library using bindings installation as instructed on the webpage, but it did not work as expected. Is there any workaround or example for defining user-defined distance functions for use in Python?

  2. Is there any way that we can get the distances resultant from the process of item insertion, rather than using the knn_query function after the items have been added? Because in some scenarios, compared to the distances between new items and the items stored in the Index, we might be more interested in the distances between the items that have stored in the Index. For instance, in the Python implementation as a part of the clustering code by Matteo Dell'Amico ([https://github.com/matteodellamico/flexible-clustering]), it is easy to retrieve the distance records resultant from the process of item insertion.

Any help on the above two questions will be greatly appreciated! Thanks.

yurymalkov commented 3 years ago

Hi @lxfhfut,

  1. I think it is possible to define the distance in a python code (probably, there are libraries for that), but I do not know a good way to do it. I am not sure what happened with your installation, adding distances should work. After making a new header you also would need to add logic to the python bindings to create a corresponding space. As a quick dirty hack you can change the l2 distance to be your metric and use the "l2" argument in the distance function.
  2. That is possible to implement - requires few changes in the python bindings.
lxfhfut commented 3 years ago

Hi @lxfhfut,

  1. I think it is possible to define the distance in a python code (probably, there are libraries for that), but I do not know a good way to do it. I am not sure what happened with your installation, adding distances should work. After making a new header you also would need to add logic to the python bindings to create a corresponding space. As a quick dirty hack you can change the l2 distance to be your metric and use the "l2" argument in the distance function.
  2. That is possible to implement - requires few changes in the python bindings.

Hi @yurymalkov, thank you so much for the help.

  1. I created a 'space_l3.h' header file by copying 'space_l2.h' file and made corresponding modifications in 'space_l3.h' and 'python_bindings/bindings.cpp'. Now I am able to call 'hnswlib.Index(space='l3', dim=512)', but the outcome of the distance calculation was not as expected. I will look into the implementation details of 'space_l3.h'.
  2. I will look into 'python_bindings/bindings.cpp' to see if I can return the distance records in the 'addItems' function.
yurymalkov commented 3 years ago

Got it! Yes, it should work as you've described in 1. For 2. you would also need to alter return of addPoint at https://github.com/nmslib/hnswlib/blob/develop/hnswlib/hnswalg.h#L1084 . Probably you would want to return top_candidates (which is ) when level==0. Note that addPoint returns tableint, but it is actually is not used anywhere so it is safe to just change the return type and call it directly with full set of arguments.

lxfhfut commented 3 years ago

Got it! Yes, it should work as you've described in 1. For 2. you would also need to alter return of addPoint at https://github.com/nmslib/hnswlib/blob/develop/hnswlib/hnswalg.h#L1084 . Probably you would want to return top_candidates (which is ) when level==0. Note that addPoint returns tableint, but it is actually is not used anywhere so it is safe to just change the return type and call it directly with full set of arguments.

Thank you for the information. Much appreciated!

matteodellamico commented 3 years ago

Hey! It's great to see that there's interested in this. I'm indeed interested in accelerating my clustering code and I was thinking about strategies to do it--with the caveat that I want to be able to use a Python function as distance. @lxfhfut are you working on it? Maybe we could join forces :)

lxfhfut commented 3 years ago

Hey! It's great to see that there's interested in this. I'm indeed interested in accelerating my clustering code and I was thinking about strategies to do it--with the caveat that I want to be able to use a Python function as distance. @lxfhfut are you working on it? Maybe we could join forces :)

Hi, @matteodellamico Yes, I am still working on it. I found that adding a self-defined distance function in C++ is quite straightforward. You just need to 1) create a new header file by copying the template of space_l2.h, 2) implement the self-defined functions in the newly-created header file, 3) include the header file in 'hnswlib.h', and 4) add the new space name in the construction function of Index in 'bindings.cpp'.

Then you should be able to call the new distance function in python.

As mentioned by @yurymalkov it is possible to define the distance in a python code, but I don't know a good way to do so either. Probably it would be easier to implement the python distance function in C++ and add it as a self-defined distance function by following the above four steps.

matteodellamico commented 3 years ago

Unfortunately that doesn't work for me, I'm providing a library supporting user-defined Python code :)