rapidsai / cuml

cuML - RAPIDS Machine Learning Library
https://docs.rapids.ai/api/cuml/stable/
Apache License 2.0
4.16k stars 525 forks source link

[BUG] cuML KNN Classifier gives lower results when comapred to sklearn KNN Classifier #4459

Closed Hadi-94 closed 1 year ago

Hadi-94 commented 2 years ago

Describe the issue I have been comparing KNeighborsClassifier from both libraries, Sklearn and cuML (Python) on my project and I have noticed that cuML KNeighborsClassifier shows lower results when is compared to sklearn KNeighborsClassifier.

Steps/Code to reproduce the issue

The dataset used has 17 features, 274628 entries, and 2 classifications (0 and 1). The dataset has been preprocessed as followed: 1- Changed NaN values to zeros. 2- Replaced specific feature's dtype from object to float32, or int. 3- Dataset has been splitted using train_test_split() from sclearn library.

df.info() of the dataset (after preprocessing) that I'm using is shown in the photo below

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 274628 entries, 0 to 274627
Data columns (total 18 columns):
 #   Column                Non-Null Count   Dtype  
---  ------                --------------   -----  
 0   address               274628 non-null  float32
 1   function              274628 non-null  float32
 2   length                274628 non-null  float32
 3   setpoint              274628 non-null  float32
 4   gain                  274628 non-null  float32
 5   reset rate            274628 non-null  float32
 6   deadband              274628 non-null  float32
 7   cycle time            274628 non-null  float32
 8   rate                  274628 non-null  float32
 9   system mode           274628 non-null  float32
 10  control scheme        274628 non-null  float32
 11  pump                  274628 non-null  float32
 12  solenoid              274628 non-null  float32
 13  pressure measurement  274628 non-null  float32
 14  crc rate              274628 non-null  float32
 15  command response      274628 non-null  float32
 16  time                  274628 non-null  float32
 17  binary result         274628 non-null  int64   
dtypes: float32(17), int64(1)
memory usage: 24.1+ MB
time: 41.8 ms (started: 2021-12-18 13:46:35 +00:00)

In the comparision script: 1- The dataset has been passed through a pipeline that uses MinMaxScaler() function as a normalization technique, and SMOTE() function as an oversmapling technqiue to oversample the training part of the dataset. 2- Both algorthims have been tested using a function that implements StratifiedKFold() and cross_validate() techniques to have a more comprehensive result. 3- The parameters for both algorithms match each other.

My testing function code is shown below:

# Function (Script) to test KNNsklearn and KNNcuml
def run_exps(X_train: pd.DataFrame , y_train: pd.DataFrame, X_test: pd.DataFrame, y_test: pd.DataFrame) -> pd.DataFrame:

  # Lightweight script to test many models and find winners
  # :param X_train: training split
  # :param y_train: training target vector
  # :param X_test: test split
  # :param y_test: test target vector
  # :return: DataFrame of predictions

  dfs = []
  models = [
            # Setting up KNN - sklearn attributes to match KNN - cuml attributes
            # KNN - cuML --> there is no setting for leaf size since the only algortihm used is "brute".
            # KNN - Sklearn --> metric_param are set to "None" by default. 
            # KNN - cuML --> metric_param not available.
            ('KNN - sklearn', KNNsklearn(n_neighbors = 3, weights='uniform', algorithm='brute',  metric='euclidean')),
            ('KNN - cuML', KNNcuml(n_neighbors = 3, weights='uniform', algorithm='brute', metric='euclidean', output_type='input'))
            ]

  results = []
  names = []
  scoring = ['accuracy', 'precision_weighted', 'recall_weighted', 'f1_weighted']

  for name, model in models:

    pipe = Pipeline([
                     ('normalization', MinMaxScaler()),
                     ('oversampling', SMOTE()),
                     ('name', model)
                     ])

    kfold = StratifiedKFold(n_splits=5)
    cv_results = cross_validate(pipe, X_train, y_train, cv=kfold, scoring=scoring, verbose=4)

    clf = model.fit(X_train, y_train)
    y_pred = clf.predict(X_test)

    print('''
    {}
    {}
    {}
    ''' .format(name, classification_report(y_test, y_pred), confusion_matrix(y_test, y_pred)))

    results.append(cv_results)
    names.append(name)
    this_df = pd.DataFrame(cv_results)
    this_df['model'] = name
    dfs.append(this_df)
    final = pd.concat(dfs, ignore_index=True)

  return final

# Loading the dataset
df = pd.read_csv("/content/drive/MyDrive/Colab Notebooks/dataset/IanDataset.csv")

# Filling missing values with zeros
df = df.fillna(0)

# Replace the data in command responce from being objects to integers
df["command response"].replace({"b'0'": "0", "b'1'": "1"}, inplace=True)
df["binary result"].replace({"b'0'": "0", "b'1'": "1"}, inplace=True)

# Change the datatype of some features to be able to be used later 
cols = df.select_dtypes(include=['float64']).columns
df[cols] = df[cols].astype('float32')
df["command response"] = pd.to_numeric(df["command response"]).astype('float32')
df["binary result"] = pd.to_numeric(df["binary result"]).astype(int)

# Extract features and Targets
X = df.iloc[:, 0:17]
y= df.iloc[:, 17]

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0, test_size=0.2)

# Calling the Testing script
run_exps(X_train, y_train, X_test, y_test)

Expected behavior The results obtained from this simple test is as follows:

    KNN - sklearn
                  precision    recall  f1-score   support

           0       0.92      0.95      0.94     42930
           1       0.80      0.71      0.75     11996

    accuracy                           0.90     54926
   macro avg       0.86      0.83      0.84     54926
weighted avg       0.89      0.90      0.90     54926

    [[40798  2132]
 [ 3507  8489]]

    KNN - cuML
                  precision    recall  f1-score   support

           0       0.78      0.93      0.85     42930
           1       0.21      0.07      0.10     11996

    accuracy                           0.74     54926
   macro avg       0.50      0.50      0.48     54926
weighted avg       0.66      0.74      0.69     54926

    [[39935  2995]
 [11196   800]]

We can notice the difference in accuracy, precision, recall and f1-score in which KNN - sklearn has scored higher. When using Confusion Matrix to compare the rsults we can also notice that: The True Negative Instances in KNN - sklearn is higher (sklearn model --> 40798, cuML Model --> 39935). The True Positive Instances in KNN - sklearn is higher (sklearn model --> 8489, cuML model --> 800). The False Positive Instances in KNN -sklearn is lower (sklearn model --> 2132, cuML model --> 2995). The False Negtaive in KNN - sklearn is lower (sklearn model --> 3507, cuML model --> 11196).

Knowing that both models have had the same parameters, the results should be very similar, however, it is not the case here as there is a huge difference in results in temrs of accuracy, precision, recall, f1-score and confusion martrix analysis .

Environment details (please complete the following information):

github-actions[bot] commented 2 years ago

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

github-actions[bot] commented 2 years ago

This issue has been labeled inactive-90d due to no recent activity in the past 90 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed.

jobs-git commented 1 year ago

I noticed this as well, @Hadi-94 did you find a solution?

beckernick commented 1 year ago

@jobs-git would you be able to share a minimal, reproducible example that illustrates this behavior? KNN Classifier uses exact nearest neighbors (which makes this unexpected).

It's not trivial to reproduce this behavior, as shown below (using the 23.04 nightly package).

from sklearn.neighbors import KNeighborsClassifier as sk_KNeighborsClassifier
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
import numpy as np
import cuml

N = 10000
K = 100

X, y = make_classification(
    n_samples=N,
    n_features=K
)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=12, test_size=0.2)

ALGORITHMS = [
    "brute",
]

N_NEIGHBORS = [
    1,
    2,
    5,
    10,
    50
]

METRICS = [
    "euclidean",
    "manhattan",
    "cosine",
]

for alg in ALGORITHMS:
    for n_neighbors in N_NEIGHBORS:
        for metric in METRICS:
            params = {
                "algorithm": alg,
                "n_neighbors": n_neighbors,
                "metric": metric,
            }
            # cuml
            clf = cuml.neighbors.KNeighborsClassifier(**params)   
            clf.fit(X_train, y_train)
            y_pred = clf.predict(X_test)
            conf_mat_cuml = confusion_matrix(y_test, y_pred)

            # sklearn
            clf = sk_KNeighborsClassifier(**params)   
            clf.fit(X_train, y_train)
            y_pred = clf.predict(X_test)
            conf_mat_skl = confusion_matrix(y_test, y_pred)
            np.testing.assert_array_equal(conf_mat_skl, conf_mat_cuml)

print("All confusion matrices match.")
All confusion matrices match.
jobs-git commented 1 year ago

@beckernick apparently, the sklearn has weights="distance" which is what I have enabled for cpu-knn, so that was the reason why sklearn performed well. On same settings weight="uniform", I was almost getting parity, unfortunately, I could not test the weight="distance" in cuml as this is not implemented yet.

Feature request was already submitted so I am not creating a new issue on that, see: https://github.com/rapidsai/cuml/issues/4611

TLDR: It was the different weight setting.

beckernick commented 1 year ago

Thanks for confirming. I'm going to close this issue as resolved.