pycaret / pycaret

An open-source, low-code machine learning library in Python
https://www.pycaret.org
MIT License
8.98k stars 1.78k forks source link

[BUG] Error running autoML function #2163

Closed sgujjaABM closed 2 years ago

sgujjaABM commented 2 years ago

Describe the bug

To Reproduce

Hi, I am trying to run PyCaret automl function, however I get an error with or without excluding 'huber' (please see error below). Can you please help with the code. Thank you.

Initiated . . . . . . . . . . . . . . . . . . 02:33:45 Status . . . . . . . . . . . . . . . . . . Preprocessing Data Text(value="Following data types have been inferred automatically, if they are correct press enter to continue or type 'quit' otherwise.", layout=Layout(width='100%')) Data Type 0 Numeric 1 Numeric 2 Numeric 3 Numeric 4 Numeric ... ... 1029 Numeric 1030 Numeric 1031 Numeric 1032 Numeric y Label [1034 rows x 1 columns] Setup Succesfully Completed! Traceback (most recent call last): File "/home/ec2-user/sgujja/qsar_modeling/repos/fup/qsar_class_pycaret_ppb.py", line 185, in <pandas.io.formats.style.Styler object at 0x7f99755d4b50> top5 = compare_models(exclude = ['huber'],n_select = 5) File "/home/ec2-user/anaconda3/envs/my-rdkit-env/lib/python3.9/site-packages/pycaret/classification.py", line 771, in compare_models return pycaret.internal.tabular.compare_models( File "/home/ec2-user/anaconda3/envs/my-rdkit-env/lib/python3.9/site-packages/pycaret/internal/tabular.py", line 1910, in compare_models raise ValueError( ValueError: Estimator Not Available huber. Please see docstring for list of available estimators.

regr = setup(data=train, target = 'y',session_id=123,fold_shuffle=True,numeric_features = features,remove_multicollinearity = True, multicollinearity_threshold = 0.95)
#
# compare all baseline models and select top 5
top5 = compare_models(exclude = ['huber'],n_select = 5)
# tune top 5 base models
tuned_top5 = [tune_model(i,n_iter=50) for i in top5]
# ensemble top 5 tuned models
bagged_top5 = [ensemble_model(i,n_estimators=100) for i in tuned_top5]
# blend top 5 base models
blender = blend_models(estimator_list = bagged_top5)
# select best model
best = automl(optimize = 'R2')
train_metrics = pull()
print("Train metrics:")
print(train_metrics)

# Deploy Model to generate predictions on hold out data
predict_model(best)
# pull
test_metrics = pull()
print("Test metrics:")
print(test_metrics)

# Predicting on unseen data
test_predictions = predict_model(best, test)
print(test_predictions.head(5))

# save
test_predictions.to_csv(path_or_buf=out+"/"+basename+"_pycaret_automl_regr_test_results.csv",index=False,quoting=3,sep=';')

Expected behavior

Additional context

Versions

dimension23 commented 2 years ago

@sgujjaABM I don't think it's related to AutoML

It looks like your problem was inferred as classification instead of regression, and as the classification does not have huber it is failing to exclude - as indicated by the ValueError message.

ValueError: Estimator Not Available huber. Please see docstring for list of available estimators.

Check the following line of your error, it is pointing to classification.py. It's likely that your prediction variable y is categorical. Please check.

File "/home/ec2-user/anaconda3/envs/my-rdkit-env/lib/python3.9/site-packages/pycaret/classification.py", line 771, in compare_models

You can run following lines of code to find out full list of supported regression models -

import pycaret

globals_dict = {}
globals_dict["seed"] = 42
globals_dict["gpu_param"] = 0
globals_dict["n_jobs_param"] = -1

_all_models = {
            k: v
            for k, v in pycaret.containers.models.regression.get_all_model_containers(
                globals_dict, raise_errors=True
            ).items()
            if not v.is_special
        }

print(_all_models)

I hope this helps!

sgujjaABM commented 2 years ago

Thank you for the reply, I am running this as a classification problem, and so initially I ran it without excluding 'huber' and got an error: please see below. The target variable is categorical. Can you please let me know if I am missing something here.

Initiated  . . . . . . . . . . . . . . . . . .                09:13:05
Status     . . . . . . . . . . . . . . . . . .  Compiling Final Models
Estimator  . . . . . . . . . . . . . . . . . .         Huber Regressor
Traceback (most recent call last):
  File "/Users/sgujja/miniconda3/envs/my-rdkit-env/lib/python3.9/site-packages/IPython/core/interactiveshell.py", line 3441, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "<ipython-input-10-249556cb224b>", line 1, in <module>
    top5 = compare_models(n_select = 5)
  File "/Users/sgujja/miniconda3/envs/my-rdkit-env/lib/python3.9/site-packages/pycaret/regression.py", line 763, in compare_models
    return pycaret.internal.tabular.compare_models(
  File "/Users/sgujja/miniconda3/envs/my-rdkit-env/lib/python3.9/site-packages/pycaret/internal/tabular.py", line 2283, in compare_models
    model, model_fit_time = create_model_supervised(
  File "/Users/sgujja/miniconda3/envs/my-rdkit-env/lib/python3.9/site-packages/pycaret/internal/tabular.py", line 3026, in create_model_supervised
    pipeline_with_model.fit(data_X, data_y, **fit_kwargs)
  File "/Users/sgujja/miniconda3/envs/my-rdkit-env/lib/python3.9/site-packages/pycaret/internal/pipeline.py", line 118, in fit
    result = super().fit(X, y=y, **fit_kwargs)
  File "/Users/sgujja/miniconda3/envs/my-rdkit-env/lib/python3.9/site-packages/imblearn/pipeline.py", line 281, in fit
    self._final_estimator.fit(Xt, yt, **fit_params)
  File "/Users/sgujja/miniconda3/envs/my-rdkit-env/lib/python3.9/site-packages/sklearn/linear_model/_huber.py", line 296, in fit
    self.n_iter_ = _check_optimize_result("lbfgs", opt_res, self.max_iter)
  File "/Users/sgujja/miniconda3/envs/my-rdkit-env/lib/python3.9/site-packages/sklearn/utils/optimize.py", line 243, in _check_optimize_result
    ).format(solver, result.status, result.message.decode("latin1"))
AttributeError: 'str' object has no attribute 'decode'
moezali1 commented 2 years ago

@sgujjaABM If you have imported classification module then you don't need to pass exclude in compare_models. huber doesn't exist in the classification module, that's why you get this exception.

sgujjaABM commented 2 years ago

Thank you for the reply. I updated the code, and it seems to be running now, however it is not using all the cores and so it's running very slowly. Can you please suggest how to speed up processing? Thank you.


##Setting up the environment in PyCaret
classf = setup(data=train, target = 'y',session_id=123,fold_shuffle=True,numeric_features = features) #,remove_multicollinearity = True, multicollinearity_threshold = 0.95)
#
# compare all baseline models and select top 5
#best = compare_models()
top5 = compare_models(n_select = 5)

# tune models
tuned_top5 = [tune_model(i) for i in top5]

# ensemble models
bagged_top5 = [ensemble_model(i) for i in tuned_top5]

# blend models
blender = blend_models(estimator_list = top5)

# stack models
stacker = stack_models(estimator_list = top5)

# automl 
best = automl()
print(best)

#analyze best model
#evaluate_model(best)

train_metrics = pull()
print("Train metrics:")
print(train_metrics)
train_metrics.to_csv(path_or_buf=out+"/"+basename+"_train_metrics.csv",index=False,quoting=3,sep=';')

# Deploy Model to generate predictions on hold out data
predict_model(best)
# pull
test_metrics = pull()
print("Test metrics:")
print(test_metrics)
moezali1 commented 2 years ago

@sgujjaABM Glad this helps.

For multiple core, please open a new issue with logs and more detailed explanation of what you are expecting, etc.