shankarpandala / lazypredict

Lazy Predict help build a lot of basic models without much code and helps understand which models works better without any parameter tuning
MIT License
2.86k stars 329 forks source link

feature not accepted - sparse matric #328

Open jolio007 opened 3 years ago

jolio007 commented 3 years ago

Description

Describe what you were trying to get done. I have one feature. This feature is a <1044374x15537 sparse matrix of type '<class 'numpy.float64'>' with 10514625 stored elements in Compressed Sparse Row format> I'm guessing that's where the problem comes from, I'm getting :

`~\AppData\Roaming\Python\Python38\site-packages\lazypredict\Supervised.py in fit(self, X_train, X_test, y_train, y_test) 269 X_test = pd.DataFrame(X_test) 270 --> 271 numeric_features = X_train.select_dtypes(include=[np.number]).columns 272 categorical_features = X_train.select_dtypes(include=["object"]).columns 273

~\AppData\Roaming\Python\Python38\site-packages\scipy\sparse\base.py in getattr(self, attr) 685 return self.getnnz() 686 else: --> 687 raise AttributeError(attr + " not found") 688 689 def transpose(self, axes=None, copy=False):

AttributeError: select_dtypes not found`

shyamcody commented 3 years ago

I guess the issue is not with the library here, but with the input object being a sparse matrix; which is instead of getting treated as a dataframe and thus throwing an error. Please use pd.DataFrame(X.toarray()) to turn your feature into a dataframe and then pass it to the model. The problem should probably go away. @brendalf do you think we need to add a check to our input for the sparse matrix? it seems a bit unnecessary to me. For referring to how to turn sparse matrix to pandas.dataframe refer to this StackOverflow answer.

brendalf commented 3 years ago

Thank you for the help @shyamcody. I agree with you, it seems a bit unnecessary. Perhaps we can just print a message if the input isn't a numpy ndarray or a pandas DataFrame. Something like: 'you need to pass a pandas DataFrame or a numpy ndarray'. What do you think?

shuchaa commented 1 year ago

Running LogisticRegression from sklearn on scipy compressed sparse row matrix is way faster, so why should he convert his matrix to a pandas dataframe? example: Pandas dataframe Train-test split: 0.82 secs Training: 3.06 secs

Sparse pandas dataframe Train-test split: 17.14 secs Training: 36.93 secs

Scipy sparse matrix Train-test split: 0.05 secs Training: 1.58 secs taken from here: https://towardsdatascience.com/working-with-sparse-data-sets-in-pandas-and-sklearn-d26c1cfbe067

I am also trying to run lazypredict and I get the same error....AND I am using the scipy sparse matrix for a reason... it is running smoothly on sklearn regression models but not on lazypredict.

ps. as NLP is in your bio, you should know that compressed sparse row matrix is wildly used in OneHotEncoding (speeding up many machine learning routines). Taken from: https://dziganto.github.io/Sparse-Matrices-For-Efficient-Machine-Learning/

torial commented 1 year ago

I tried using the toarray, and got a memory error for my TD-IDF training set.

numpy.core._exceptions._ArrayMemoryError: Unable to allocate 180. GiB for an array with shape (121408, 198922) and data type float64

torial commented 1 year ago

I have a fix that worked for me to use a sparse matrix. Here's the sample code for LazyClassifier's fit method. I added a should_preprocess boolean parameter that should be set to False if you have a sparse matrix.

` def fit(self, X_train, X_test, y_train, y_test, should_preprocess: bool = True): """Fit Classification algorithms to X_train and y_train, predict and score on X_test, y_test. Parameters

    X_train : array-like,
        Training vectors, where rows is the number of samples
        and columns is the number of features.
    X_test : array-like,
        Testing vectors, where rows is the number of samples
        and columns is the number of features.
    y_train : array-like,
        Training vectors, where rows is the number of samples
        and columns is the number of features.
    y_test : array-like,
        Testing vectors, where rows is the number of samples
        and columns is the number of features.
    should_preprocess : bool,
        Indicates if preprocessing columns is needed.
        Turn this off if your matrix is sparse.
    Returns
    -------
    scores : Pandas DataFrame
        Returns metrics of all the models in a Pandas DataFrame.
    predictions : Pandas DataFrame
        Returns predictions of all the models in a Pandas DataFrame.
    """
    Accuracy = []
    B_Accuracy = []
    ROC_AUC = []
    F1 = []
    names = []
    TIME = []
    predictions = {}

    if self.custom_metric is not None:
        CUSTOM_METRIC = []

    if isinstance(X_train, np.ndarray):
        X_train = pd.DataFrame(X_train)
        X_test = pd.DataFrame(X_test)

    preprocessor = None
    if should_preprocess:
        numeric_features = X_train.select_dtypes(include=[np.number]).columns
        categorical_features = X_train.select_dtypes(include=["object"]).columns

        categorical_low, categorical_high = get_card_split(
            X_train, categorical_features
        )

        preprocessor = ColumnTransformer(
            transformers=[
                ("numeric", numeric_transformer, numeric_features),
                ("categorical_low", categorical_transformer_low, categorical_low),
                ("categorical_high", categorical_transformer_high, categorical_high),
            ]
        )

    if self.classifiers == "all":
        self.classifiers = CLASSIFIERS
    else:
        try:
            temp_list = []
            for classifier in self.classifiers:
                full_name = (classifier.__name__, classifier)
                temp_list.append(full_name)
            self.classifiers = temp_list
        except Exception as exception:
            print(exception)
            print("Invalid Classifier(s)")

    for name, model in tqdm(self.classifiers):
        start = time.time()
        try:
            steps = []
            if should_preprocess:
                steps.append(("preprocessor", preprocessor))
            if "random_state" in model().get_params().keys():
                steps.append(("classifier", model(random_state=self.random_state)))
            else:
                steps.append(("classifier", model()))
            pipe = Pipeline(steps=steps)

            pipe.fit(X_train, y_train)
            self.models[name] = pipe
            y_pred = pipe.predict(X_test)
            accuracy = accuracy_score(y_test, y_pred, normalize=True)
            b_accuracy = balanced_accuracy_score(y_test, y_pred)
            f1 = f1_score(y_test, y_pred, average="weighted")
            try:
                roc_auc = roc_auc_score(y_test, y_pred)
            except Exception as exception:
                roc_auc = None
                if self.ignore_warnings is False:
                    print("ROC AUC couldn't be calculated for " + name)
                    print(exception)
            names.append(name)
            Accuracy.append(accuracy)
            B_Accuracy.append(b_accuracy)
            ROC_AUC.append(roc_auc)
            F1.append(f1)
            TIME.append(time.time() - start)
            if self.custom_metric is not None:
                custom_metric = self.custom_metric(y_test, y_pred)
                CUSTOM_METRIC.append(custom_metric)
            if self.verbose > 0:
                if self.custom_metric is not None:
                    print(
                        {
                            "Model": name,
                            "Accuracy": accuracy,
                            "Balanced Accuracy": b_accuracy,
                            "ROC AUC": roc_auc,
                            "F1 Score": f1,
                            self.custom_metric.__name__: custom_metric,
                            "Time taken": time.time() - start,
                        }
                    )
                else:
                    print(
                        {
                            "Model": name,
                            "Accuracy": accuracy,
                            "Balanced Accuracy": b_accuracy,
                            "ROC AUC": roc_auc,
                            "F1 Score": f1,
                            "Time taken": time.time() - start,
                        }
                    )
            if self.predictions:
                predictions[name] = y_pred
        except Exception as exception:
            print(f"{name} got error: {exception}")
            if self.ignore_warnings is False:
                print(name + " model failed to execute")
                print(exception)
    if self.custom_metric is None:
        scores = pd.DataFrame(
            {
                "Model": names,
                "Accuracy": Accuracy,
                "Balanced Accuracy": B_Accuracy,
                "ROC AUC": ROC_AUC,
                "F1 Score": F1,
                "Time Taken": TIME,
            }
        )
    else:
        scores = pd.DataFrame(
            {
                "Model": names,
                "Accuracy": Accuracy,
                "Balanced Accuracy": B_Accuracy,
                "ROC AUC": ROC_AUC,
                "F1 Score": F1,
                self.custom_metric.__name__: CUSTOM_METRIC,
                "Time Taken": TIME,
            }
        )
    scores = scores.sort_values(by="Balanced Accuracy", ascending=False).set_index(
        "Model"
    )

    if self.predictions:
        predictions_df = pd.DataFrame.from_dict(predictions)
    return scores, predictions_df if self.predictions is True else scores

`