shmulvad / close_numerical_matches

Finds close numerical matches across two arrays
MIT License
2 stars 0 forks source link

how to add custom cosine similarity distance #1

Closed ttsesm closed 1 year ago

ttsesm commented 1 year ago

Hi @shmulvad,

I am interested to use a custom distance with your lib, and more specifically the cosine similarity distance https://dataaspirant.com/five-most-popular-similarity-measures-implementation-in-python/ but I am not sure how to implement it in a similar way as you are doing here with the manhatan distance.

Could you please give a hint or something.

Thanks.

shmulvad commented 1 year ago

Hi @ttsesm,

It makes sense that you have some trouble implementing that. Due to the way the library is set up it would be impossible. Here is the main code where the supplied distance function is used (specifically L90):

https://github.com/shmulvad/close_numerical_matches/blob/36270777772a06398cdd8fb85b4a09a9af6eaeb2/close_numerical_matches/find_matches.py#L85-L93

As you can see, the custom function takes an array of the coordinate-wise differences as input. Critical information needed to compute cosine distance is lost when doing this subtraction.

Cosine distance is really a metric for vectors and not coordinates. It would be possible to adapt the code to handle taking a distance function with the required signature. But keep in mind that the vectors [1, 1]^T and [100, 100]^T also have a distance of 0, but if they are just considered as coordinates, the distance is quite big. So for this library to work with cosine distance, we also need to normalize all input coordinates to unit length before assigning them to buckets.

I do see the benefit of being able to use cosine distance though. I might take a stab at supporting something like this over the weekend.

ttsesm commented 1 year ago

Hi @shmulvad, thanks a lot for the feedback. I see what you mean. That would be nice, since for my problem I am trying to figure out which distance would give me the best result and so far I am stack to wrong results.

To briefly give you an understanding of my problem. I have two 2d numpy matrices of different dimensions. More specifically of 202736x3 let's call it source array and one of 216000x3 which we can call it the target array now these matrices contain numerical values which represent rgb color values per row. Now for the target array I also have a third array of the same size which represents direction vector. So not I wanted to find the best matches between the source and target arrays so that the corresponding directions vectors from the third array are more or less pointing to similar direction. For example these are my source and target matrices values plotted as distributions in the 3d space:

ezgif com-optimize data1.zip

shmulvad commented 1 year ago
  1. If you are comparing RGB values to RGB values, I don't think cosine distance is suitable. An almost fully black colour of [1, 1, 1] would be considered the same as completely white [255, 255, 255], and complete black [0, 0, 0] would not be well-defined.
  2. If you are comparing one array of RGB values to one array of direction vectors, I don't think any distance metric is suitable. The comparison does not really make sense.
  3. If you are comparing one array of directions to another array of directions, then cosine distance is indeed suitable. However, from your description, it sounds like there is only a single array of direction vectors. Are you looking to find the pairwise matches in this single array?

Finally, if this is just a one-off thing for you to compute, it might be easier to simply use sklearn.metrics.pairwise.cosine_similarity and take the wait you have to incur once rather than depend on this library. But who knows, depending on how many vectors you have pointing in the same direction, it might be much faster to use that library.

shmulvad commented 1 year ago

I tried looking at your data. I assume you want to find the pairwise close matches in drays? In that case, be aware that most pairs are fairly close. We can also see this from the plot you shared above. I tried to plot the approximate distribution (based on sampling a small subset) of cosine distances, and this is what I get:

drays = ... # Read from your supplied data
n_vecs = 2000
sampled_indices = np.random.choice(len(drays), n_vecs, replace=False)
sampled_drays = drays[sampled_indices]
sim_mat = cosine_similarity(sampled_drays)
sims = sim_mat[np.triu_indices(n_vecs, k=1)]
dists = 1.0 - sims
sns.histplot(dists, stat='percent')
plt.xlabel('Cosine distance')

ddcaaad4-ac30-46fd-af84-428199c597ac

As you can see, you actually have a lot similar vectors. Let's say you want the ones with a distance of less than 0.0001. That makes up about 0.0095% of your dataset. This would give you roughly 4.15M pairs (assuming the sampled results holds for the dataset as a whole).

n = len(drays)
n_pairs = n**2 - n  # ignoring diagonal
n_close_pairs = 0.00009454727363 * n_pairs # ~4,411,177

Be aware that due to cosine distance being a distance metric for vectors and not points, and this library really being for points, you need to set the bucket tolerance multiplier extremely high. I need to set the bucket_tol_mult=200 before I stop getting additional pairs when setting it higher.

I have modified the code to be able to take other distance functions (and also support cosine distance natively). The new version is available as v0.2.0 (pip install close-numerical-matches==0.2.0). The following line of code takes about 15 seconds to execute on my computer and finds 4,324,534 pairs:

matches = find_matches(drays, drays, tol=0.0001, dist='cos', bucket_tol_mult=200)
len(matches) # 4,324,534 

All the same indices (e.g., [0, 0], [1, 1], etc.) will be found here, which you might want to filter out. In that case, your code would become something like

matches_raw = find_matches(drays, drays, tol=0.0001, dist='cos', bucket_tol_mult=200)
same_index = matches_raw[:, 0] == matches_raw[:, 1]
matches = matches_raw[~same_index]
len(matches)  # 4,108,534

Check it here: https://colab.research.google.com/drive/1ThVe1cFsPPmtlRB2KR5wRmwBgb-5EFDR?usp=sharing

I think that now the library should either be able to handle your problem or alternatively your problem is out of scope for what this library tries to solve. As such, I will consider this issue closed.

ttsesm commented 1 year ago

@shmulvad thanks for the feedback and the update.