mljar / mljar-supervised

Python package for AutoML on Tabular Data with Feature Engineering, Hyper-Parameters Tuning, Explanations and Automatic Documentation
https://mljar.com
MIT License
3.04k stars 407 forks source link

stacking + kmeans features produces error #741

Open pplonski opened 3 months ago

pplonski commented 3 months ago

here is log from training

:28: PerformanceWarning: DataFrame is highly fragmented.  This is usually the result of calling `frame.insert` many times, which has poor performance.  Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()`
'module' object is not callable
11_Xgboost_categorical_mix_GoldenFeatures accuracy 0.989817 trained in 263.21 seconds
'module' object is not callable
29_CatBoost_GoldenFeatures accuracy 0.98992 trained in 387.23 seconds
* Step kmeans_features will try to check up to 3 models
'module' object is not callable
11_Xgboost_categorical_mix_KMeansFeatures accuracy 0.989888 trained in 278.27 seconds
'module' object is not callable
29_CatBoost_KMeansFeatures accuracy 0.990045 trained in 446.58 seconds
Not enough time to perform features selection. Skip
Time needed for features selection ~ 3607.0 seconds
Please increase total_time_limit to at least (36126 seconds) to have features selection
Skip insert_random_feature because no parameters were generated.
Skip features_selection because no parameters were generated.
* Step hill_climbing_1 will try to check up to 29 models
'module' object is not callable
56_Xgboost accuracy 0.989923 trained in 310.37 seconds
'module' object is not callable
57_Xgboost accuracy 0.990025 trained in 773.99 seconds
'module' object is not callable
58_CatBoost accuracy 0.990042 trained in 352.36 seconds
'module' object is not callable
59_CatBoost accuracy 0.990013 trained in 386.31 seconds
'module' object is not callable
60_Xgboost accuracy 0.990106 trained in 218.31 seconds
* Step hill_climbing_2 will try to check up to 29 models
'module' object is not callable
61_Xgboost accuracy 0.989955 trained in 217.98 seconds
'module' object is not callable
62_Xgboost accuracy 0.989987 trained in 249.24 seconds
'module' object is not callable
63_CatBoost accuracy 0.990077 trained in 344.56 seconds
'module' object is not callable
64_CatBoost accuracy 0.990045 trained in 431.3 seconds
* Step boost_on_errors will try to check up to 1 model
'module' object is not callable
60_Xgboost_BoostOnErrors accuracy 0.990006 trained in 206.81 seconds
* Step ensemble will try to check up to 1 model
'module' object is not callable
Ensemble accuracy 0.990799 trained in 33.97 seconds
* Step stack will try to check up to 28 models
'module' object is not callable
60_Xgboost_Stacked accuracy 0.990905 trained in 144.26 seconds
'module' object is not callable
63_CatBoost_Stacked accuracy 0.990982 trained in 201.37 seconds
'module' object is not callable
4_Default_LightGBM_Stacked accuracy 0.990895 trained in 155.36 seconds
'module' object is not callable
7_Default_NeuralNetwork_Stacked accuracy 0.990664 trained in 1462.74 seconds
38_RandomForest_Stacked not trained. Stop training after the first fold. Time needed to train on the first fold 69.0 seconds. The time estimate for training on all folds is larger than total_time_limit.
9_Default_ExtraTrees_Stacked not trained. Stop training after the first fold. Time needed to train on the first fold 62.0 seconds. The time estimate for training on all folds is larger than total_time_limit.
* Step ensemble_stacked will try to check up to 1 model
'module' object is not callable
Ensemble_Stacked accuracy 0.990998 trained in 39.94 seconds
AutoML fit time: 14544.82 seconds
AutoML best model: Ensemble_Stacked
Traceback (most recent call last):
  File "/home/piotr/sandbox/tps-may/baseline.py", line 15, in <module>
    y_predicted = model.predict(test[x_cols])
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/piotr/sandbox/tps-may/venv/lib/python3.11/site-packages/supervised/automl.py", line 451, in predict
    return self._predict(X)
           ^^^^^^^^^^^^^^^^
  File "/home/piotr/sandbox/tps-may/venv/lib/python3.11/site-packages/supervised/base_automl.py", line 1503, in _predict
    predictions = self._base_predict(X)
                  ^^^^^^^^^^^^^^^^^^^^^
  File "/home/piotr/sandbox/tps-may/venv/lib/python3.11/site-packages/supervised/base_automl.py", line 1465, in _base_predict
    predictions = model.predict(X, X_stacked)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/piotr/sandbox/tps-may/venv/lib/python3.11/site-packages/supervised/ensemble.py", line 434, in predict
    y_predicted_from_model = model.predict(X_stacked)
                             ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/piotr/sandbox/tps-may/venv/lib/python3.11/site-packages/supervised/model_framework.py", line 447, in predict
    X_data, _, _ = self.preprocessings[ind].transform(X.copy(), None)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/piotr/sandbox/tps-may/venv/lib/python3.11/site-packages/supervised/preprocessing/preprocessing.py", line 395, in transform
    X_validation = self._kmeans.transform(X_validation)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/piotr/sandbox/tps-may/venv/lib/python3.11/site-packages/supervised/preprocessing/kmeans_transformer.py", line 72, in transform
    X_scaled = self._scale.transform(X[self._input_columns])
                                     ~^^^^^^^^^^^^^^^^^^^^^
  File "/home/piotr/sandbox/tps-may/venv/lib/python3.11/site-packages/pandas/core/frame.py", line 4108, in __getitem__
    indexer = self.columns._get_indexer_strict(key, "columns")[1]
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/piotr/sandbox/tps-may/venv/lib/python3.11/site-packages/pandas/core/indexes/base.py", line 6200, in _get_indexer_strict
    self._raise_if_missing(keyarr, indexer, axis_name)
  File "/home/piotr/sandbox/tps-may/venv/lib/python3.11/site-packages/pandas/core/indexes/base.py", line 6252, in _raise_if_missing
    raise KeyError(f"{not_found} not in index")
KeyError: "['Ensemble_prediction_0_for_e_1_for_p', '60_Xgboost_prediction_0_for_e_1_for_p', '63_CatBoost_prediction_0_for_e_1_for_p', '11_Xgboost_categorical_mix_prediction_0_for_e_1_for_p', '29_CatBoost_KMeansFeatures_prediction_0_for_e_1_for_p', '64_CatBoost_prediction_0_for_e_1_for_p', '58_CatBoost_prediction_0_for_e_1_for_p', '57_Xgboost_prediction_0_for_e_1_for_p', '59_CatBoost_prediction_0_for_e_1_for_p', '60_Xgboost_BoostOnErrors_prediction_0_for_e_1_for_p', '62_Xgboost_prediction_0_for_e_1_for_p', '29_CatBoost_prediction_0_for_e_1_for_p', '61_Xgboost_prediction_0_for_e_1_for_p', '11_Xgboost_prediction_0_for_e_1_for_p', '56_Xgboost_prediction_0_for_e_1_for_p', '29_CatBoost_GoldenFeatures_prediction_0_for_e_1_for_p', '10_Xgboost_prediction_0_for_e_1_for_p', '11_Xgboost_categorical_mix_KMeansFeatures_prediction_0_for_e_1_for_p', '28_CatBoost_prediction_0_for_e_1_for_p', '4_Default_LightGBM_prediction_0_for_e_1_for_p', '6_Default_CatBoost_prediction_0_for_e_1_for_p', '20_LightGBM_prediction_0_for_e_1_for_p', '19_LightGBM_prediction_0_for_e_1_for_p', '7_Default_NeuralNetwork_prediction_0_for_e_1_for_p', '55_NeuralNetwork_prediction_0_for_e_1_for_p', '38_RandomForest_prediction_0_for_e_1_for_p', '9_Default_ExtraTrees_prediction_0_for_e_1_for_p', '46_ExtraTrees_prediction_0_for_e_1_for_p', '8_Default_RandomForest_prediction_0_for_e_1_for_p', '2_DecisionTree_prediction_0_for_e_1_for_p', '3_DecisionTree_prediction_0_for_e_1_for_p', '37_RandomForest_prediction_0_for_e_1_for_p', '1_DecisionTree_prediction_0_for_e_1_for_p'] not in index"
pplonski commented 2 weeks ago

Related to #722

pplonski commented 2 weeks ago

Data set used: https://www.kaggle.com/competitions/playground-series-s4e8

Code:

import pandas as pd
from supervised import AutoML

# load fraction of data for speed up
train = pd.read_csv("playground-series-s4e8/train.csv").sample(frac=0.1, random_state=123)

print(train.head())
x_cols =  train.columns[2:]
y_col = train.columns[1]
print(x_cols, y_col)

model = AutoML(eval_metric="accuracy", total_time_limit=3600, mode="Compete", kmeans_features=False)
model.fit(train[x_cols], train[y_col])

test = pd.read_csv("playground-series-s4e8/test.csv")
y_predicted = model.predict(test[x_cols])

submission = pd.read_csv("playground-series-s4e8/sample_submission.csv")
submission["class"] = y_predicted
submission.to_csv("baseline_m_2.csv", index=False)