pycaret / pycaret

An open-source, low-code machine learning library in Python
https://www.pycaret.org
MIT License
8.82k stars 1.76k forks source link

[BUG]: ClassificationExperiment().optimize_threshold() raises error 'CustomProbabilityThresholdClassifier' has no len() #3915

Open christopherkottke opened 6 months ago

christopherkottke commented 6 months ago

pycaret version checks

Issue Description

When attempting to run optimize_threshold on a model, I get the error that 'CustomProbabilityThresholdClassifier' has no len(). I have reproduced this error on the latest version of pycaret and the master branch. In both cases, the virtual environment was only constructed from pip install pycaret or pip install -U git+https://github.com/pycaret/pycaret.git@master with no other package installs.

It appears to step from the optimization function on line 2687 of pycaret/classification/oop.py: model_results["model"] = model[0] Somewhere down the line inside of pandas, there is a check whether model[0] has the same length of some index in the dataframe, but since model[0] is of type CustomProbabilityThresholdClassifier and has no len() function, it causes an error.

Thank you for your assistance.

Reproducible Example

from pycaret.classification import ClassificationExperiment
from pycaret.datasets import get_data

data = get_data('diabetes')
ex = ClassificationExperiment()
ex.setup(data=data, target='Class variable', session_id=123)
model = ex._create_model('rf')

threshold = ex.optimize_threshold(model)

Expected Behavior

I would expect that this code would optimize the classification threshold with respect to accuracy and return the optimized value in threshold.

Actual Results

Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "$USER/$VENV/lib/python3.10/site-packages/pycaret/classification/oop.py", line 2703, in optimize_threshold
    result = shgo(objective, ((0, 1),), **shgo_kwargs)
  File "$USER/$VENV/lib/python3.10/site-packages/scipy/optimize/_shgo.py", line 454, in shgo
    shc.iterate_all()
  File "$USER/$VENV/lib/python3.10/site-packages/scipy/optimize/_shgo.py", line 832, in iterate_all
    self.iterate()
  File "$USER/$VENV/lib/python3.10/site-packages/scipy/optimize/_shgo.py", line 1004, in iterate
    self.iterate_complex()
  File "$USER/$VENV/lib/python3.10/site-packages/scipy/optimize/_shgo.py", line 1108, in iterate_delaunay
    self.HC.V.process_pools()
  File "$USER/$VENV/lib/python3.10/site-packages/scipy/optimize/_shgo_lib/_vertex.py", line 327, in process_pools
    self.process_fpool()
  File "$USER/$VENV/lib/python3.10/site-packages/scipy/optimize/_shgo_lib/_vertex.py", line 384, in proc_fpool_nog
    self.compute_sfield(v)
  File "$USER/$VENV/lib/python3.10/site-packages/scipy/optimize/_shgo_lib/_vertex.py", line 347, in compute_sfield
    v.f = self.field(v.x_a, *self.field_args)
  File "$USER/$VENV/lib/python3.10/site-packages/scipy/_lib/_util.py", line 360, in __call__
    return self.f(x, *self.args)
  File "$USER/$VENV/lib/python3.10/site-packages/pycaret/classification/oop.py", line 2687, in objective
    model_results["model"] = model[0]
  File "$USER/$VENV/lib/python3.10/site-packages/pandas/core/frame.py", line 4091, in __setitem__
    self._set_item(key, value)
  File "$USER/$VENV/lib/python3.10/site-packages/pandas/core/frame.py", line 4300, in _set_item
    value, refs = self._sanitize_column(value)
  File "$USER/$VENV/lib/python3.10/site-packages/pandas/core/frame.py", line 5039, in _sanitize_column
    com.require_length_match(value, self.index)
  File "$USER/$VENV/lib/python3.10/site-packages/pandas/core/common.py", line 560, in require_length_match
    if len(data) != len(index):
TypeError: object of type 'CustomProbabilityThresholdClassifier' has no len()

Installed Versions

System: python: 3.10.13 (main, Sep 11 2023, 13:44:35) [GCC 11.2.0] executable: $USER/$VENV/bin/python machine: Linux-5.15.133.1-microsoft-standard-WSL2-x86_64-with-glibc2.35 PyCaret required dependencies: pip: 23.3.1 setuptools: 68.2.2 pycaret: 3.3.0 IPython: 8.22.0 ipywidgets: 8.1.2 tqdm: 4.66.2 numpy: 1.26.4 pandas: 2.1.4 jinja2: 3.1.3 scipy: 1.11.4 joblib: 1.3.2 sklearn: 1.4.1.post1 pyod: 1.1.3 imblearn: 0.12.0 category_encoders: 2.6.3 lightgbm: 4.3.0 numba: 0.59.0 requests: 2.31.0 matplotlib: 3.7.5 scikitplot: 0.3.7 yellowbrick: 1.5 plotly: 5.19.0 plotly-resampler: Not installed kaleido: 0.2.1 schemdraw: 0.15 statsmodels: 0.14.1 sktime: 0.26.0 tbats: 1.1.3 pmdarima: 2.0.4 psutil: 5.9.0 markupsafe: 2.1.5 pickle5: Not installed cloudpickle: 3.0.0 deprecation: 2.1.0 xxhash: 3.4.1 wurlitzer: 3.0.3 PyCaret optional dependencies: shap: Not installed interpret: Not installed umap: Not installed ydata_profiling: Not installed explainerdashboard: Not installed autoviz: Not installed fairlearn: Not installed deepchecks: Not installed xgboost: Not installed catboost: Not installed kmodes: Not installed mlxtend: Not installed statsforecast: Not installed tune_sklearn: Not installed ray: Not installed hyperopt: Not installed optuna: Not installed skopt: Not installed mlflow: Not installed gradio: Not installed fastapi: Not installed uvicorn: Not installed m2cgen: Not installed evidently: Not installed fugue: Not installed streamlit: Not installed prophet: Not installed
celestinoxp commented 6 months ago

Check if your project directory has a logs.txt file

christopherkottke commented 6 months ago

Yes, however there is nothing pertaining to the error in the logs. This is the contents of the logs.txt file resulting just from ex.optimize_threshold(model) which resulted in the error:

2024-02-26 09:36:16,439:INFO:Initializing optimize_threshold() 2024-02-26 09:36:16,440:INFO:optimize_threshold(return_data=False, plot_kwargs=None, shgo_kwargs={}, estimator=RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None, criterion='gini', max_depth=None, max_features='sqrt', max_leaf_nodes=None, max_samples=None, min_impurity_decrease=0.0, min_samples_leaf=1, min_samples_split=2, min_weight_fraction_leaf=0.0, monotonic_cst=None, n_estimators=100, n_jobs=-1, oob_score=False, random_state=123, verbose=0, warm_start=False), optimize=Accuracy, self=<pycaret.classification.oop.ClassificationExperiment object at 0x7f235e7711b0>, verbose=True) 2024-02-26 09:36:16,440:INFO:Importing libraries 2024-02-26 09:36:16,441:INFO:Checking exceptions 2024-02-26 09:36:16,441:INFO:defining variables 2024-02-26 09:36:16,442:INFO:starting optimization 2024-02-26 09:36:16,444:INFO:Initializing create_model() 2024-02-26 09:36:16,444:INFO:create_model(self=<pycaret.classification.oop.ClassificationExperiment object at 0x7f235e7711b0>, estimator=RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None, criterion='gini', max_depth=None, max_features='sqrt', max_leaf_nodes=None, max_samples=None, min_impurity_decrease=0.0, min_samples_leaf=1, min_samples_split=2, min_weight_fraction_leaf=0.0, monotonic_cst=None, n_estimators=100, n_jobs=-1, oob_score=False, random_state=123, verbose=0, warm_start=False), fold=None, round=4, cross_validation=True, predict=True, fit_kwargs=None, groups=None, refit=True, probability_threshold=0.375, experiment_custom_tags=None, verbose=False, system=False, add_to_model_list=True, metrics=None, display=None, model_only=True, return_train_score=False, error_score=0.0, kwargs={}) 2024-02-26 09:36:16,445:INFO:Checking exceptions 2024-02-26 09:36:16,446:INFO:Importing libraries 2024-02-26 09:36:16,447:INFO:Copying training dataset 2024-02-26 09:36:16,449:INFO:Defining folds 2024-02-26 09:36:16,450:INFO:Declaring metric variables 2024-02-26 09:36:16,450:INFO:Importing untrained model 2024-02-26 09:36:16,451:INFO:Declaring custom model 2024-02-26 09:36:16,451:INFO:Random Forest Classifier Imported successfully 2024-02-26 09:36:16,452:INFO:Starting cross validation 2024-02-26 09:36:16,453:INFO:Cross validating with StratifiedKFold(n_splits=10, random_state=None, shuffle=False), n_jobs=-1 2024-02-26 09:36:16,794:INFO:Calculating mean and std 2024-02-26 09:36:16,795:INFO:Creating metrics dataframe 2024-02-26 09:36:16,797:INFO:Finalizing model 2024-02-26 09:36:16,955:INFO:Uploading results into container 2024-02-26 09:36:16,956:INFO:Uploading model into container now 2024-02-26 09:36:16,956:INFO:_master_model_container: 3 2024-02-26 09:36:16,957:INFO:_display_container: 3 2024-02-26 09:36:16,958:INFO:CustomProbabilityThresholdClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None, classifier=RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None, criterion='gini', max_depth=None, max_features='sqrt', max_leaf_nodes=None, max_samples=None, min_impurity_decrease=0.0, min_samples_leaf=1, min_samples_split=2, min_weight_fraction_leaf=0.0, mo... random_state=123, verbose=0, warm_start=False), criterion='gini', max_depth=None, max_features='sqrt', max_leaf_nodes=None, max_samples=None, min_impurity_decrease=0.0, min_samples_leaf=1, min_samples_split=2, min_weight_fraction_leaf=0.0, monotonic_cst=None, n_estimators=100, n_jobs=-1, oob_score=False, probability_threshold=0.375, random_state=123, verbose=0, warm_start=False) 2024-02-26 09:36:16,958:INFO:create_model() successfully completed......................................

CyberGiant7 commented 5 months ago

I have the exact same problem. I investigated a little bit about the reason for this bug. Basically when the line is executed model_results["model"] = model[0] The model column is created and inserted into the dataframe model_results. But when inserting into the dataframe pandas performs some checks on the type of object to be inserted inside the row. In this specific case the type that is detected is a list_like in the section

if is_list_like(value):
          com.require_length_match(value, self.index)

is_list_like returns true since the model has the attribute __iter__and then goes into the if and tries to execute the require_length_match function. The problem is that the model does not have the len attribute and so the require_length_match function fails.

julianspaeth commented 4 months ago

Is there a workaround to fix this bug?

christopherkottke commented 4 months ago

Is there a workaround to fix this bug?

Not within pycaret as far as I've found. The workaround I've used is to run the threshold optimization myself and create a wrapper class which uses the optimized threshold on calls to predict_model. To optimize the threshold, you can do a shgo optimization as is implemented in pycaret, but since the problem is just 1D, it's also simple enough to create a linspaced array and check which threshold optimizes your metric of interest. It would be nice however if this were resolved within pycaret. Especially when optimizing for a 0/1 loss like accuracy, where thresholding can dramatically change the model's performance, it should be done behind the scenes during the model_compare and tune_model steps.

celestinoxp commented 4 months ago

@christopherkottke can you make a pull-request?

tomsz92161 commented 4 months ago

It appears that the error occurs only with certain models, specifically 'gbc', 'ada', 'et', 'catboost', and 'rf'

Lawlantosin commented 3 months ago

Has the error been resolved? It seems that the optimized threshold is not working with some tree models.