Closed Hadi-94 closed 1 year ago
This issue has been labeled inactive-30d
due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d
if there is no activity in the next 60 days.
This issue has been labeled inactive-90d
due to no recent activity in the past 90 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed.
I noticed this as well, @Hadi-94 did you find a solution?
@jobs-git would you be able to share a minimal, reproducible example that illustrates this behavior? KNN Classifier uses exact nearest neighbors (which makes this unexpected).
It's not trivial to reproduce this behavior, as shown below (using the 23.04 nightly package).
from sklearn.neighbors import KNeighborsClassifier as sk_KNeighborsClassifier
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
import numpy as np
import cuml
N = 10000
K = 100
X, y = make_classification(
n_samples=N,
n_features=K
)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=12, test_size=0.2)
ALGORITHMS = [
"brute",
]
N_NEIGHBORS = [
1,
2,
5,
10,
50
]
METRICS = [
"euclidean",
"manhattan",
"cosine",
]
for alg in ALGORITHMS:
for n_neighbors in N_NEIGHBORS:
for metric in METRICS:
params = {
"algorithm": alg,
"n_neighbors": n_neighbors,
"metric": metric,
}
# cuml
clf = cuml.neighbors.KNeighborsClassifier(**params)
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
conf_mat_cuml = confusion_matrix(y_test, y_pred)
# sklearn
clf = sk_KNeighborsClassifier(**params)
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
conf_mat_skl = confusion_matrix(y_test, y_pred)
np.testing.assert_array_equal(conf_mat_skl, conf_mat_cuml)
print("All confusion matrices match.")
All confusion matrices match.
@beckernick apparently, the sklearn has weights="distance" which is what I have enabled for cpu-knn, so that was the reason why sklearn performed well. On same settings weight="uniform", I was almost getting parity, unfortunately, I could not test the weight="distance" in cuml as this is not implemented yet.
Feature request was already submitted so I am not creating a new issue on that, see: https://github.com/rapidsai/cuml/issues/4611
TLDR: It was the different weight
setting.
Thanks for confirming. I'm going to close this issue as resolved.
Describe the issue I have been comparing KNeighborsClassifier from both libraries, Sklearn and cuML (Python) on my project and I have noticed that cuML KNeighborsClassifier shows lower results when is compared to sklearn KNeighborsClassifier.
Steps/Code to reproduce the issue
The dataset used has 17 features, 274628 entries, and 2 classifications
(0 and 1)
. The dataset has been preprocessed as followed: 1- ChangedNaN
values to zeros. 2- Replaced specific feature's dtype fromobject
tofloat32
, orint
. 3- Dataset has been splitted usingtrain_test_split()
from sclearn library.df.info()
of the dataset (after preprocessing) that I'm using is shown in the photo belowIn the comparision script: 1- The dataset has been passed through a pipeline that uses
MinMaxScaler()
function as a normalization technique, andSMOTE()
function as an oversmapling technqiue to oversample the training part of the dataset. 2- Both algorthims have been tested using a function that implementsStratifiedKFold()
andcross_validate()
techniques to have a more comprehensive result. 3- The parameters for both algorithms match each other.My testing function code is shown below:
Expected behavior The results obtained from this simple test is as follows:
We can notice the difference in accuracy, precision, recall and f1-score in which KNN - sklearn has scored higher. When using Confusion Matrix to compare the rsults we can also notice that: The True Negative Instances in KNN - sklearn is higher (sklearn model --> 40798, cuML Model --> 39935). The True Positive Instances in KNN - sklearn is higher (sklearn model --> 8489, cuML model --> 800). The False Positive Instances in KNN -sklearn is lower (sklearn model --> 2132, cuML model --> 2995). The False Negtaive in KNN - sklearn is lower (sklearn model --> 3507, cuML model --> 11196).
Knowing that both models have had the same parameters, the results should be very similar, however, it is not the case here as there is a huge difference in results in temrs of accuracy, precision, recall, f1-score and confusion martrix analysis .
Environment details (please complete the following information):