mljar / mljar-supervised

Python package for AutoML on Tabular Data with Feature Engineering, Hyper-Parameters Tuning, Explanations and Automatic Documentation
https://mljar.com
MIT License
3.06k stars 410 forks source link

Unsure How to Load Stacked/GoldenFeature Models #607

Open woochan-jang opened 1 year ago

woochan-jang commented 1 year ago

Hello,

thanks for putting out this package. It seems very usable/promising.

I'm using the 0.11.5 version mljar-supervised from conda-forge. One little question: after giving it a shot, I wanted to load one of the goldenfeature-stacked model.

with open(f'../logs/{MLJAR_date}/2_Optuna_LightGBM_GoldenFeatures_Stacked/framework.json', 'r') as f:
    framework = json.load(f)
lgbm_gf_st = LightgbmAlgorithm(framework)
lgbm_gf_st.load(f'../logs/{MLJAR_date}/2_Optuna_LightGBM_GoldenFeatures_Stacked/learner_fold_0.lightgbm')

When I ran model.predict() on the above model, however, it asked me for the non-original features, i.e. the metafeatures and the goldfeatures. Do I need to build the pipeline myself, or is there a built-in API where I can load these types models as a whole?

I tried to select the model from the automl object below, but couldn't figure it out..

automl = AutoML(results_path=f"../logs/{MLJAR_date}/")
automl.load(f"../logs/{MLJAR_date}/")

Also, this automl object seems to have no attributes set before I run the load function, so I have no idea which model it's using for predictions. (And when I run automl._models, it is missing quite a few of the models from what I have trained.)

{'1_Baseline': <supervised.model_framework.ModelFramework at 0x7f0b0386caf0>,
 '2_Optuna_LightGBM': <supervised.model_framework.ModelFramework at 0x7f0ce3f1d280>,
 '3_Optuna_Xgboost': <supervised.model_framework.ModelFramework at 0x7f0c316ddac0>,
 '4_Optuna_CatBoost': <supervised.model_framework.ModelFramework at 0x7f0cec9ce6d0>,
 '5_Optuna_NeuralNetwork': <supervised.model_framework.ModelFramework at 0x7f0af81a7be0>,
 '6_Optuna_RandomForest': <supervised.model_framework.ModelFramework at 0x7f0ac01f8d30>,
 '7_Optuna_ExtraTrees': <supervised.model_framework.ModelFramework at 0x7f0a64097d30>,
 '2_Optuna_LightGBM_GoldenFeatures': <supervised.model_framework.ModelFramework at 0x7f0a0ca87160>,
 '6_Optuna_RandomForest_GoldenFeatures': <supervised.model_framework.ModelFramework at 0x7f0ca830adf0>,
 '3_Optuna_Xgboost_GoldenFeatures': <supervised.model_framework.ModelFramework at 0x7f0a04a2e790>,
 'Ensemble': <supervised.ensemble.Ensemble at 0x7f0a0bac0f40>,
 '2_Optuna_LightGBM_Stacked': <supervised.model_framework.ModelFramework at 0x7f09fe2e7430>,
 '6_Optuna_RandomForest_Stacked': <supervised.model_framework.ModelFramework at 0x7f09f584fca0>,
 '3_Optuna_Xgboost_Stacked': <supervised.model_framework.ModelFramework at 0x7f09f42179d0>,
 '4_Optuna_CatBoost_Stacked': <supervised.model_framework.ModelFramework at 0x7f035c3dd160>,
 '7_Optuna_ExtraTrees_Stacked': <supervised.model_framework.ModelFramework at 0x7eff409c5dc0>,
 '5_Optuna_NeuralNetwork_Stacked': <supervised.model_framework.ModelFramework at 0x7eff30105940>,
 '2_Optuna_LightGBM_GoldenFeatures_Stacked': <supervised.model_framework.ModelFramework at 0x7efed42cc520>,
 '6_Optuna_RandomForest_GoldenFeatures_Stacked': <supervised.model_framework.ModelFramework at 0x7efe0c1c3a60>,
 '3_Optuna_Xgboost_GoldenFeatures_Stacked': <supervised.model_framework.ModelFramework at 0x7efdf4424370>,
 'Ensemble_Stacked': <supervised.ensemble.Ensemble at 0x7f09fe2b0550>}

in training versus

{'1_Baseline': <supervised.model_framework.ModelFramework at 0x7f2200afe700>,
 '2_Optuna_LightGBM': <supervised.model_framework.ModelFramework at 0x7f2200afea30>,
 '2_Optuna_LightGBM_GoldenFeatures': <supervised.model_framework.ModelFramework at 0x7f2200afe3a0>,
 '3_Optuna_Xgboost': <supervised.model_framework.ModelFramework at 0x7f21f410b220>,
 '3_Optuna_Xgboost_GoldenFeatures': <supervised.model_framework.ModelFramework at 0x7f2200b70160>,
 '4_Optuna_CatBoost': <supervised.model_framework.ModelFramework at 0x7f2200b70250>,
 '5_Optuna_NeuralNetwork': <supervised.model_framework.ModelFramework at 0x7f167975ac40>,
 '6_Optuna_RandomForest': <supervised.model_framework.ModelFramework at 0x7f21ff348700>,
 '6_Optuna_RandomForest_GoldenFeatures': <supervised.model_framework.ModelFramework at 0x7f1679616970>,
 '6_Optuna_RandomForest_GoldenFeatures_Stacked': <supervised.model_framework.ModelFramework at 0x7f22007068b0>,
 '7_Optuna_ExtraTrees': <supervised.model_framework.ModelFramework at 0x7f1ea43337c0>,
 'Ensemble': <supervised.ensemble.Ensemble at 0x7f16797301f0>,
 'Ensemble_Stacked': <supervised.ensemble.Ensemble at 0x7f16797302e0>}

after loading.

thank you!

woochan-jang commented 1 year ago

I've just tried to run one of the stacked models from automl object, and it's throwing me this error:

code:

models_map["Ensemble_Stacked"].predict(preprocessed_X_test[model.feature_name_])

error:

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
Cell In[310], line 1
----> 1 preprocessed_X_test["surrogate_logit"] = models_map["Ensemble_Stacked"].predict(preprocessed_X_test[model.feature_name_])
      2 preprocessed_X_test["surrogate_score"] = np.exp(preprocessed_X_test["surrogate_logit"])/(1+np.exp(preprocessed_X_test["surrogate_logit"]))

File ~/anaconda3/envs/mljar/lib/python3.8/site-packages/supervised/ensemble.py:306, in Ensemble.predict(self, X, X_stacked)
    303 total_repeat += repeat
    305 if model._is_stacked:
--> 306     y_predicted_from_model = model.predict(X_stacked)
    307 else:
    308     y_predicted_from_model = model.predict(X)

File ~/anaconda3/envs/mljar/lib/python3.8/site-packages/supervised/model_framework.py:428, in ModelFramework.predict(self, X)
    425 y_predicted = None  # np.zeros((X.shape[0],))
    426 for ind, learner in enumerate(self.learners):
    427     # preprocessing goes here
--> 428     X_data, _, _ = self.preprocessings[ind].transform(X.copy(), None)
    429     y_p = learner.predict(X_data)
    430     y_p = self.preprocessings[ind].inverse_scale_target(y_p)

AttributeError: 'NoneType' object has no attribute 'copy'

the data is all floats, no nans or infs. Same columns as the data originally fit into the AutoML().fit function.

pplonski commented 1 year ago

Hi @woochan-jang,

To load AutoML models just create AutoML object with the same results path:

automl = AutoML(results_path=f"../logs/{MLJAR_date}/")

AutoML will load all needed models and use the best model for prediction. Just call:

predictions = automl.predict(new_data)

Computing predictions from the selected model (different than best model) will be implemented in this issue https://github.com/mljar/mljar-supervised/issues/423