Open christopherkottke opened 6 months ago
Check if your project directory has a logs.txt file
Yes, however there is nothing pertaining to the error in the logs. This is the contents of the logs.txt file resulting just from ex.optimize_threshold(model)
which resulted in the error:
2024-02-26 09:36:16,439:INFO:Initializing optimize_threshold() 2024-02-26 09:36:16,440:INFO:optimize_threshold(return_data=False, plot_kwargs=None, shgo_kwargs={}, estimator=RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None, criterion='gini', max_depth=None, max_features='sqrt', max_leaf_nodes=None, max_samples=None, min_impurity_decrease=0.0, min_samples_leaf=1, min_samples_split=2, min_weight_fraction_leaf=0.0, monotonic_cst=None, n_estimators=100, n_jobs=-1, oob_score=False, random_state=123, verbose=0, warm_start=False), optimize=Accuracy, self=<pycaret.classification.oop.ClassificationExperiment object at 0x7f235e7711b0>, verbose=True) 2024-02-26 09:36:16,440:INFO:Importing libraries 2024-02-26 09:36:16,441:INFO:Checking exceptions 2024-02-26 09:36:16,441:INFO:defining variables 2024-02-26 09:36:16,442:INFO:starting optimization 2024-02-26 09:36:16,444:INFO:Initializing create_model() 2024-02-26 09:36:16,444:INFO:create_model(self=<pycaret.classification.oop.ClassificationExperiment object at 0x7f235e7711b0>, estimator=RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None, criterion='gini', max_depth=None, max_features='sqrt', max_leaf_nodes=None, max_samples=None, min_impurity_decrease=0.0, min_samples_leaf=1, min_samples_split=2, min_weight_fraction_leaf=0.0, monotonic_cst=None, n_estimators=100, n_jobs=-1, oob_score=False, random_state=123, verbose=0, warm_start=False), fold=None, round=4, cross_validation=True, predict=True, fit_kwargs=None, groups=None, refit=True, probability_threshold=0.375, experiment_custom_tags=None, verbose=False, system=False, add_to_model_list=True, metrics=None, display=None, model_only=True, return_train_score=False, error_score=0.0, kwargs={}) 2024-02-26 09:36:16,445:INFO:Checking exceptions 2024-02-26 09:36:16,446:INFO:Importing libraries 2024-02-26 09:36:16,447:INFO:Copying training dataset 2024-02-26 09:36:16,449:INFO:Defining folds 2024-02-26 09:36:16,450:INFO:Declaring metric variables 2024-02-26 09:36:16,450:INFO:Importing untrained model 2024-02-26 09:36:16,451:INFO:Declaring custom model 2024-02-26 09:36:16,451:INFO:Random Forest Classifier Imported successfully 2024-02-26 09:36:16,452:INFO:Starting cross validation 2024-02-26 09:36:16,453:INFO:Cross validating with StratifiedKFold(n_splits=10, random_state=None, shuffle=False), n_jobs=-1 2024-02-26 09:36:16,794:INFO:Calculating mean and std 2024-02-26 09:36:16,795:INFO:Creating metrics dataframe 2024-02-26 09:36:16,797:INFO:Finalizing model 2024-02-26 09:36:16,955:INFO:Uploading results into container 2024-02-26 09:36:16,956:INFO:Uploading model into container now 2024-02-26 09:36:16,956:INFO:_master_model_container: 3 2024-02-26 09:36:16,957:INFO:_display_container: 3 2024-02-26 09:36:16,958:INFO:CustomProbabilityThresholdClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None, classifier=RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None, criterion='gini', max_depth=None, max_features='sqrt', max_leaf_nodes=None, max_samples=None, min_impurity_decrease=0.0, min_samples_leaf=1, min_samples_split=2, min_weight_fraction_leaf=0.0, mo... random_state=123, verbose=0, warm_start=False), criterion='gini', max_depth=None, max_features='sqrt', max_leaf_nodes=None, max_samples=None, min_impurity_decrease=0.0, min_samples_leaf=1, min_samples_split=2, min_weight_fraction_leaf=0.0, monotonic_cst=None, n_estimators=100, n_jobs=-1, oob_score=False, probability_threshold=0.375, random_state=123, verbose=0, warm_start=False) 2024-02-26 09:36:16,958:INFO:create_model() successfully completed......................................
I have the exact same problem. I investigated a little bit about the reason for this bug. Basically when the line is executed
model_results["model"] = model[0]
The model column is created and inserted into the dataframe model_results. But when inserting into the dataframe pandas performs some checks on the type of object to be inserted inside the row.
In this specific case the type that is detected is a list_like in the section
if is_list_like(value):
com.require_length_match(value, self.index)
is_list_like
returns true since the model has the attribute __iter__
and then goes into the if and tries to execute the require_length_match
function. The problem is that the model does not have the len
attribute and so the require_length_match
function fails.
Is there a workaround to fix this bug?
Is there a workaround to fix this bug?
Not within pycaret as far as I've found. The workaround I've used is to run the threshold optimization myself and create a wrapper class which uses the optimized threshold on calls to predict_model. To optimize the threshold, you can do a shgo optimization as is implemented in pycaret, but since the problem is just 1D, it's also simple enough to create a linspaced array and check which threshold optimizes your metric of interest. It would be nice however if this were resolved within pycaret. Especially when optimizing for a 0/1 loss like accuracy, where thresholding can dramatically change the model's performance, it should be done behind the scenes during the model_compare and tune_model steps.
@christopherkottke can you make a pull-request?
It appears that the error occurs only with certain models, specifically 'gbc', 'ada', 'et', 'catboost', and 'rf'
Has the error been resolved? It seems that the optimized threshold is not working with some tree models.
pycaret version checks
[X] I have checked that this issue has not already been reported here.
[X] I have confirmed this bug exists on the latest version of pycaret.
[X] I have confirmed this bug exists on the master branch of pycaret (pip install -U git+https://github.com/pycaret/pycaret.git@master).
Issue Description
When attempting to run optimize_threshold on a model, I get the error that 'CustomProbabilityThresholdClassifier' has no len(). I have reproduced this error on the latest version of pycaret and the master branch. In both cases, the virtual environment was only constructed from pip install pycaret or pip install -U git+https://github.com/pycaret/pycaret.git@master with no other package installs.
It appears to step from the optimization function on line 2687 of pycaret/classification/oop.py:
model_results["model"] = model[0]
Somewhere down the line inside of pandas, there is a check whether model[0] has the same length of some index in the dataframe, but since model[0] is of type CustomProbabilityThresholdClassifier and has no len() function, it causes an error.Thank you for your assistance.
Reproducible Example
Expected Behavior
I would expect that this code would optimize the classification threshold with respect to accuracy and return the optimized value in threshold.
Actual Results
Installed Versions