Open tarang-jain opened 2 years ago
I don't see any reason why it should happen both L2Expanded
and L2Unexpanded
should generate nearly similar results. what is the number of neighbors k
value in this case?
perhaps you may file a bug for this with repro test case ? and I can help to debug the issue.
The value of n_neighbors
is 10.
Here is the test snippet:
from sklearn import datasets
from cuml.manifold.umap import UMAP as cuUMAP
from sklearn.manifold import trustworthiness
import umap
data, labels = datasets.make_blobs(n_samples = 1000, n_features = 64, centers = 5)
for metric in ['sqeuclidean']:
if metric == 'jaccard':
data = data >= 0
cuml_model = cuUMAP(n_neighbors=10, min_dist=0.01, metric=metric, init='random')
umap_model = umap.UMAP(n_neighbors=10, min_dist=0.01, metric=metric, init='random')
cuml_embedding = cuml_model.fit_transform(data)
umap_embedding = umap_model.fit_transform(data)
cu_trust = trustworthiness(data, cuml_embedding, n_neighbors=10, metric=metric)
umap_trust = trustworthiness(data, umap_embedding, n_neighbors=10, metric=metric)
print("metric:", metric, "cuml_trust:", cu_trust, "umap_trust:", umap_trust)
Here are the results:
The case when we use L2Expanded for sqeuclidean
metric: sqeuclidean cuml_trust: 0.8431028948704926 umap_trust: 0.9320104621635348
When I use L2Unexpanded for sqeuclidean
metric: sqeuclidean cuml_trust: 0.9329909598781108 umap_trust: 0.9307451498222448
Within my commits, the exact place where this metric is being changed is L440 in umap.pyx in my commits here.
Thanks for providing repro steps, I've a cuml pytest build working now, will sync your branch and try to repro soon.
@tarang-jain there seems to be some bug in your umap implementation, even if I specify DistanceType.L2Expanded
on L440 in umap.pyx still L2Unexpanded
metric is been passed to brute_force_knn
RAFT API. Along with this the additional issue is it somehow passes that the Queries are column major & database/index is row major.
I think you need to debug this issue in umap side to make sure correct distance metric is getting passed to knn call along with correct layouts.
Alright, thanks for the reply. However, I did not modify any calls to brute_force_knn
. What I have only done is add additional distance metrics. I will still look deeper into this.
@mdoijade I had already updated the metric in the call to raft::spatial::knn::brute_force_knn
here. So I believe that the metric is being passed correctly. About the Queries being column major and database/index being row major: I did not modify any of the existing parameters in the call to raft::spatial::knn::brute_force_knn
and all the other UMAP pytests are still passing. So I am not sure what is wrong here.
@tarang-jain Looks the issue is resolved by using ML::brute_force_knn
the cuML wrapper of the raft version, I guess it is passing stale arguments to raft::spatial::knn::brute_force_knn
if used directly from algo.cuh
.
please apply this patch and check your issue should be resolved. We need to check with @cjnolet why incorrect arguments getting passed when directly using raft::spatial::knn::brute_force_knn
whereas while using ML::brute_force_knn
wrapper it is correctly passing.
umap_fix.zip
This issue has been labeled inactive-30d
due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d
if there is no activity in the next 60 days.
What is your question? @mdoijade I was working on #4779 to introduce more distance metrics in TSNE and UMAP in cuml, in addition to the euclidean distance. I noticed that particularly for the
sqeuclidean
distance metric in UMAP, when I useraft::distance::DistanceType::L2Expanded
for thebrute_force_knn
, I do not get a good trustworthiness score (<0.85) with a sklearnmake_blobs
dataset consisting of 1000 rows. However, when I useraft::distance::DistanceType::L2Unxpanded
, the trustworthiness scores are more in line with those forumap.UMAP
(>0.93) for the same dataset. Is there any reason for this?