rapidsai / raft

RAFT contains fundamental widely-used algorithms and primitives for machine learning and information retrieval. The algorithms are CUDA-accelerated and form building blocks for more easily writing high performance applications.
https://docs.rapids.ai/api/raft/stable/
Apache License 2.0
764 stars 194 forks source link

[QST] Discrepancy between results from using L2Unexpanded and L2Expanded distance metrics #731

Open tarang-jain opened 2 years ago

tarang-jain commented 2 years ago

What is your question? @mdoijade I was working on #4779 to introduce more distance metrics in TSNE and UMAP in cuml, in addition to the euclidean distance. I noticed that particularly for the sqeuclidean distance metric in UMAP, when I use raft::distance::DistanceType::L2Expanded for the brute_force_knn, I do not get a good trustworthiness score (<0.85) with a sklearn make_blobs dataset consisting of 1000 rows. However, when I use raft::distance::DistanceType::L2Unxpanded, the trustworthiness scores are more in line with those for umap.UMAP (>0.93) for the same dataset. Is there any reason for this?

mdoijade commented 2 years ago

I don't see any reason why it should happen both L2Expanded and L2Unexpanded should generate nearly similar results. what is the number of neighbors k value in this case? perhaps you may file a bug for this with repro test case ? and I can help to debug the issue.

tarang-jain commented 2 years ago

The value of n_neighbors is 10.

Here is the test snippet:

from sklearn import datasets
from cuml.manifold.umap import UMAP as cuUMAP
from sklearn.manifold import trustworthiness
import umap

data, labels = datasets.make_blobs(n_samples = 1000, n_features = 64, centers = 5)

for metric in ['sqeuclidean']:
    if metric == 'jaccard':
        data = data >= 0

    cuml_model = cuUMAP(n_neighbors=10, min_dist=0.01, metric=metric, init='random')

    umap_model = umap.UMAP(n_neighbors=10, min_dist=0.01, metric=metric, init='random')
    cuml_embedding = cuml_model.fit_transform(data)
    umap_embedding = umap_model.fit_transform(data)
    cu_trust = trustworthiness(data, cuml_embedding, n_neighbors=10, metric=metric)
    umap_trust = trustworthiness(data, umap_embedding, n_neighbors=10, metric=metric)
    print("metric:", metric, "cuml_trust:", cu_trust, "umap_trust:", umap_trust)

Here are the results:

  1. The case when we use L2Expanded for sqeuclidean metric: sqeuclidean cuml_trust: 0.8431028948704926 umap_trust: 0.9320104621635348

  2. When I use L2Unexpanded for sqeuclidean metric: sqeuclidean cuml_trust: 0.9329909598781108 umap_trust: 0.9307451498222448

Within my commits, the exact place where this metric is being changed is L440 in umap.pyx in my commits here.

mdoijade commented 2 years ago

Thanks for providing repro steps, I've a cuml pytest build working now, will sync your branch and try to repro soon.

mdoijade commented 2 years ago

@tarang-jain there seems to be some bug in your umap implementation, even if I specify DistanceType.L2Expanded on L440 in umap.pyx still L2Unexpanded metric is been passed to brute_force_knn RAFT API. Along with this the additional issue is it somehow passes that the Queries are column major & database/index is row major. I think you need to debug this issue in umap side to make sure correct distance metric is getting passed to knn call along with correct layouts.

tarang-jain commented 2 years ago

Alright, thanks for the reply. However, I did not modify any calls to brute_force_knn. What I have only done is add additional distance metrics. I will still look deeper into this.

tarang-jain commented 2 years ago

@mdoijade I had already updated the metric in the call to raft::spatial::knn::brute_force_knn here. So I believe that the metric is being passed correctly. About the Queries being column major and database/index being row major: I did not modify any of the existing parameters in the call to raft::spatial::knn::brute_force_knn and all the other UMAP pytests are still passing. So I am not sure what is wrong here.

mdoijade commented 2 years ago

@tarang-jain Looks the issue is resolved by using ML::brute_force_knn the cuML wrapper of the raft version, I guess it is passing stale arguments to raft::spatial::knn::brute_force_knn if used directly from algo.cuh. please apply this patch and check your issue should be resolved. We need to check with @cjnolet why incorrect arguments getting passed when directly using raft::spatial::knn::brute_force_knn whereas while using ML::brute_force_knn wrapper it is correctly passing. umap_fix.zip

github-actions[bot] commented 2 years ago

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.