[Bug]: blend_models() and stack_models() Fail with Certain Models Above 1000 Samples
Description:
This issue appears to be related to a problem with PyCaret's blend_models() and stack_models() functions when using certain models and exceeding 1000 samples. The error occurs during the finalization stage of model creation, resulting in a "ValueError: cannot set WRITEABLE flag to True of this array" error. This issue does not appear to be related to memory limitations, as testing on systems with 8-24 cores and 32-128 GB of RAM did not solve the problem.
Reproducible Code:
I've included a reproducible example below.
Affected Models:
The following models cause the error when included in blend_models() or stack_models():
lda, ridge, ada, dt, gbc, nb, et, rf, qda
All other models tested (including svm, lr, xgboost, lightgbm, knn, catboost, and dummy) work without issue when combined. Furthermore, blending or stacking using a single dt model also causes the error. This indicates the issue is likely related to how PyCaret handles models like these during the finalization step.
Observations:
Finalization Error: The error appears when PyCaret is "Finalizing Model" during the blending or stacking process.
Manual Ensemble Success: Manually creating and fitting ensembles (using VotingClassifier and StackingClassifier) does not produce this error.
Dataset Size: The error only occurs when the dataset size exceeds 1000 rows.
Memory Profiler: Memory profiling (using memory_profiler) indicates that the memory usage remains below 400 MB, for both 500 and 750 samples with on a 4-5 MB increase, suggesting that the problem is not memory-related.
Versions: The issue persists across PyCaret versions (including master branch) and scikit-learn versions 1.4.1 and above.
Data Writability: The DataFrame is writable at all stages, before and after setup, model creation, and even after the error occurs.
This issue appears to be a bug in PyCaret's blend_models() and stack_models() functions. Any assistance in resolving this would be greatly appreciated.
Additional Context:
The error occurs even with a dataset of just 1000 rows, which is relatively small.
The issue appears to be related to the way PyCaret handles specific models during the finalization process of blending or stacking.
The problem persists across various versions of PyCaret and scikit-learn.
Reproducible Example
import pandas as pd
import numpy as np
from pycaret.classification import ClassificationExperiment
# Generate Synthetic Data
train_data = pd.DataFrame(np.random.rand(750, 384)) # Works for 750 or less
# train_data = pd.DataFrame(np.random.rand(5000, 384)) # Fails for 1000 or more
train_data['target'] = np.random.randint(0, 2, len(train_data))
# Ensure a copy is made
train_data = train_data.copy(deep=True)
# Check Data Writability
print(f"DataFrame Flags Before Setup: {train_data.values.flags}")
exp = ClassificationExperiment()
exp.setup(data=train_data, target='target', session_id=123, fold=3, verbose=True)
print(f"DataFrame Flags After Setup: {train_data.values.flags}")
# Create Models
lr = exp.create_model('lr')
dt = exp.create_model('dt')
knn = exp.create_model('knn')
print(f"DataFrame Flags After Creating Models: {train_data.values.flags}")
# Attempt to blend models (this is where the error likely occurs)
try:
blended_model = exp.blend_models(estimator_list=[lr, dt, knn])
# Check DataFrame flags after blending
print(f"DataFrame Flags After Blending: {train_data.values.flags}")
except ValueError as e:
print(f"Error: {e}")
print(f"DataFrame Flags After Blending Error: {train_data.values.flags}")
# Attempt to stack models (this is where the error likely occurs)
try:
stacked_model = exp.stack_models(estimator_list=[lr, dt, knn], meta_model=LogisticRegression())
# Check DataFrame flags after stacking
print(f"DataFrame Flags After Stacking: {train_data.values.flags}")
except ValueError as e:
print(f"Error: {e}")
print(f"DataFrame Flags After Stacking Error: {train_data.values.flags}")
# --- (Optional: Manually create and fit models to show no issue) ---
voting_clf = VotingClassifier(
estimators=[('lr', lr), ('dt', dt), ('knn', knn)],
voting='hard'
)
voting_clf.fit(train_data.drop('target', axis=1), train_data['target'])
estimators = [
('lr', lr),
('dt', dt),
('knn', knn)
]
stacked_clf = StackingClassifier(
estimators=estimators,
final_estimator=LogisticRegression()
)
stacked_clf.fit(train_data.drop('target', axis=1), train_data['target'])
Expected Behavior
No Errors
Actual Results
Initiated . . . . . . . . . . . . . . . . . . 18:44:38
Status . . . . . . . . . . . . . . . . . . Compiling Estimators
Estimator . . . . . . . . . . . . . . . . . . Voting Classifier
---------------------------------------------------------------------------
_RemoteTraceback Traceback (most recent call last)
_RemoteTraceback:
"""
Traceback (most recent call last):
File "/home/cmobley/mambaforge/envs/project_1/lib/python3.11/site-packages/joblib/externals/loky/process_executor.py", line 463, in _process_worker
r = call_item()
^^^^^^^^^^^
File "/home/cmobley/mambaforge/envs/project_1/lib/python3.11/site-packages/joblib/externals/loky/process_executor.py", line 291, in __call__
return self.fn(*self.args, **self.kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/cmobley/mambaforge/envs/project_1/lib/python3.11/site-packages/joblib/parallel.py", line 589, in __call__
return [func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^
File "/home/cmobley/mambaforge/envs/project_1/lib/python3.11/site-packages/joblib/parallel.py", line 589, in <listcomp>
return [func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/home/cmobley/mambaforge/envs/project_1/lib/python3.11/site-packages/sklearn/utils/parallel.py", line 129, in __call__
return self.function(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/cmobley/mambaforge/envs/project_1/lib/python3.11/site-packages/sklearn/ensemble/_base.py", line 36, in _fit_single_estimator
estimator.fit(X, y)
File "/home/cmobley/mambaforge/envs/project_1/lib/python3.11/site-packages/sklearn/base.py", line 1474, in wrapper
return fit_method(estimator, *args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/cmobley/mambaforge/envs/project_1/lib/python3.11/site-packages/sklearn/tree/_classes.py", line 1009, in fit
super()._fit(
File "/home/cmobley/mambaforge/envs/project_1/lib/python3.11/site-packages/sklearn/tree/_classes.py", line 252, in _fit
X, y = self._validate_data(
^^^^^^^^^^^^^^^^^^^^
File "/home/cmobley/mambaforge/envs/project_1/lib/python3.11/site-packages/sklearn/base.py", line 645, in _validate_data
X = check_array(X, input_name="X", **check_X_params)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/cmobley/mambaforge/envs/project_1/lib/python3.11/site-packages/sklearn/utils/validation.py", line 1097, in check_array
array.flags.writeable = True
^^^^^^^^^^^^^^^^^^^^^
ValueError: cannot set WRITEABLE flag to True of this array
"""
The above exception was the direct cause of the following exception:
ValueError Traceback (most recent call last)
Cell In[10], line 1
----> 1 blended_model = exp.blend_models(estimator_list=[dt])
File ~/mambaforge/envs/project_1/lib/python3.11/site-packages/pycaret/classification/oop.py:1803, in ClassificationExperiment.blend_models(self, estimator_list, fold, round, choose_better, optimize, method, weights, fit_kwargs, groups, probability_threshold, verbose, return_train_score)
1699 def blend_models(
1700 self,
1701 estimator_list: list,
(...)
1712 return_train_score: bool = False,
1713 ) -> Any:
1714 """
1715 This function trains a Soft Voting / Majority Rule classifier for select
1716 models passed in the ``estimator_list`` param. The output of this function
(...)
1800
1801 """
-> 1803 return super().blend_models(
1804 estimator_list=estimator_list,
1805 fold=fold,
1806 round=round,
1807 choose_better=choose_better,
1808 optimize=optimize,
1809 method=method,
1810 weights=weights,
1811 fit_kwargs=fit_kwargs,
1812 groups=groups,
1813 verbose=verbose,
1814 probability_threshold=probability_threshold,
1815 return_train_score=return_train_score,
1816 )
File ~/mambaforge/envs/project_1/lib/python3.11/site-packages/pycaret/internal/pycaret_experiment/supervised_experiment.py:3486, in _SupervisedExperiment.blend_models(self, estimator_list, fold, round, choose_better, optimize, method, weights, fit_kwargs, groups, probability_threshold, verbose, return_train_score)
3481 display.move_progress()
3483 self.logger.info(
3484 "SubProcess create_model() called =================================="
3485 )
-> 3486 model, model_fit_time = self._create_model(
3487 estimator=model,
3488 system=False,
3489 display=display,
3490 fold=fold,
3491 round=round,
3492 fit_kwargs=fit_kwargs,
3493 groups=groups,
3494 probability_threshold=probability_threshold,
3495 return_train_score=return_train_score,
3496 )
3498 model_results = self.pull()
3499 self.logger.info(
3500 "SubProcess create_model() end =================================="
3501 )
File ~/mambaforge/envs/project_1/lib/python3.11/site-packages/pycaret/internal/pycaret_experiment/supervised_experiment.py:1533, in _SupervisedExperiment._create_model(self, estimator, fold, round, cross_validation, predict, fit_kwargs, groups, refit, probability_threshold, experiment_custom_tags, verbose, system, add_to_model_list, X_train_data, y_train_data, metrics, display, model_only, return_train_score, error_score, **kwargs)
1530 return model, model_fit_time
1531 return model
-> 1533 model, model_fit_time, model_results, _ = self._create_model_with_cv(
1534 model=model,
1535 data_X=data_X,
1536 data_y=data_y,
1537 fit_kwargs=fit_kwargs,
1538 round=round,
1539 cv=cv,
1540 groups=groups,
1541 metrics=metrics,
1542 refit=refit,
1543 system=system,
1544 display=display,
1545 error_score=error_score,
1546 return_train_score=return_train_score,
1547 )
1549 # end runtime
1550 runtime_end = time.time()
File ~/mambaforge/envs/project_1/lib/python3.11/site-packages/pycaret/internal/pycaret_experiment/supervised_experiment.py:1223, in _SupervisedExperiment._create_model_with_cv(self, model, data_X, data_y, fit_kwargs, round, cv, groups, metrics, refit, system, display, error_score, return_train_score)
1221 self.logger.info("Finalizing model")
1222 with redirect_output(self.logger):
-> 1223 pipeline_with_model.fit(data_X, data_y, **fit_kwargs)
1224 model_fit_end = time.time()
1226 # calculating metrics on predictions of complete train dataset
File ~/mambaforge/envs/project_1/lib/python3.11/site-packages/pycaret/internal/pipeline.py:278, in Pipeline.fit(self, X, y, **params)
276 if self._final_estimator != "passthrough":
277 last_step_params = routed_params[self.steps[-1][0]]
--> 278 fitted_estimator = self._memory_fit(
279 clone(self.steps[-1][1]), X, y, **last_step_params["fit"]
280 )
281 # Hacky way to make sure that the state of the estimator
282 # loaded from cache is carried over to the estimator
283 # in steps
284 _copy_estimator_state(fitted_estimator, self.steps[-1][1])
File ~/mambaforge/envs/project_1/lib/python3.11/site-packages/joblib/memory.py:353, in NotMemorizedFunc.__call__(self, *args, **kwargs)
352 def __call__(self, *args, **kwargs):
--> 353 return self.func(*args, **kwargs)
File ~/mambaforge/envs/project_1/lib/python3.11/site-packages/pycaret/internal/pipeline.py:69, in _fit_one(transformer, X, y, message, params)
67 if "y" in signature(transformer.fit).parameters:
68 args.append(y)
---> 69 transformer.fit(*args)
70 return transformer
File ~/mambaforge/envs/project_1/lib/python3.11/site-packages/sklearn/base.py:1474, in _fit_context.<locals>.decorator.<locals>.wrapper(estimator, *args, **kwargs)
1467 estimator._validate_params()
1469 with config_context(
1470 skip_parameter_validation=(
1471 prefer_skip_nested_validation or global_skip_validation
1472 )
1473 ):
-> 1474 return fit_method(estimator, *args, **kwargs)
File ~/mambaforge/envs/project_1/lib/python3.11/site-packages/sklearn/ensemble/_voting.py:366, in VotingClassifier.fit(self, X, y, sample_weight)
363 self.classes_ = self.le_.classes_
364 transformed_y = self.le_.transform(y)
--> 366 return super().fit(X, transformed_y, sample_weight)
File ~/mambaforge/envs/project_1/lib/python3.11/site-packages/sklearn/ensemble/_voting.py:89, in _BaseVoting.fit(self, X, y, sample_weight)
83 if self.weights is not None and len(self.weights) != len(self.estimators):
84 raise ValueError(
85 "Number of `estimators` and weights must be equal; got"
86 f" {len(self.weights)} weights, {len(self.estimators)} estimators"
87 )
---> 89 self.estimators_ = Parallel(n_jobs=self.n_jobs)(
90 delayed(_fit_single_estimator)(
91 clone(clf),
92 X,
93 y,
94 sample_weight=sample_weight,
95 message_clsname="Voting",
96 message=self._log_message(names[idx], idx + 1, len(clfs)),
97 )
98 for idx, clf in enumerate(clfs)
99 if clf != "drop"
100 )
102 self.named_estimators_ = Bunch()
104 # Uses 'drop' as placeholder for dropped estimators
File ~/mambaforge/envs/project_1/lib/python3.11/site-packages/sklearn/utils/parallel.py:67, in Parallel.__call__(self, iterable)
62 config = get_config()
63 iterable_with_config = (
64 (_with_config(delayed_func, config), args, kwargs)
65 for delayed_func, args, kwargs in iterable
66 )
---> 67 return super().__call__(iterable_with_config)
File ~/mambaforge/envs/project_1/lib/python3.11/site-packages/joblib/parallel.py:1952, in Parallel.__call__(self, iterable)
1946 # The first item from the output is blank, but it makes the interpreter
1947 # progress until it enters the Try/Except block of the generator and
1948 # reach the first `yield` statement. This starts the aynchronous
1949 # dispatch of the tasks to the workers.
1950 next(output)
-> 1952 return output if self.return_generator else list(output)
File ~/mambaforge/envs/project_1/lib/python3.11/site-packages/joblib/parallel.py:1595, in Parallel._get_outputs(self, iterator, pre_dispatch)
1592 yield
1594 with self._backend.retrieval_context():
-> 1595 yield from self._retrieve()
1597 except GeneratorExit:
1598 # The generator has been garbage collected before being fully
1599 # consumed. This aborts the remaining tasks if possible and warn
1600 # the user if necessary.
1601 self._exception = True
File ~/mambaforge/envs/project_1/lib/python3.11/site-packages/joblib/parallel.py:1699, in Parallel._retrieve(self)
1692 while self._wait_retrieval():
1693
1694 # If the callback thread of a worker has signaled that its task
1695 # triggered an exception, or if the retrieval loop has raised an
1696 # exception (e.g. `GeneratorExit`), exit the loop and surface the
1697 # worker traceback.
1698 if self._aborting:
-> 1699 self._raise_error_fast()
1700 break
1702 # If the next job is not ready for retrieval yet, we just wait for
1703 # async callbacks to progress.
File ~/mambaforge/envs/project_1/lib/python3.11/site-packages/joblib/parallel.py:1734, in Parallel._raise_error_fast(self)
1730 # If this error job exists, immediatly raise the error by
1731 # calling get_result. This job might not exists if abort has been
1732 # called directly or if the generator is gc'ed.
1733 if error_job is not None:
-> 1734 error_job.get_result(self.timeout)
File ~/mambaforge/envs/project_1/lib/python3.11/site-packages/joblib/parallel.py:736, in BatchCompletionCallBack.get_result(self, timeout)
730 backend = self.parallel._backend
732 if backend.supports_retrieve_callback:
733 # We assume that the result has already been retrieved by the
734 # callback thread, and is stored internally. It's just waiting to
735 # be returned.
--> 736 return self._return_or_raise()
738 # For other backends, the main thread needs to run the retrieval step.
739 try:
File ~/mambaforge/envs/project_1/lib/python3.11/site-packages/joblib/parallel.py:754, in BatchCompletionCallBack._return_or_raise(self)
752 try:
753 if self.status == TASK_ERROR:
--> 754 raise self._result
755 return self._result
756 finally:
ValueError: cannot set WRITEABLE flag to True of this array
pycaret version checks
[X] I have checked that this issue has not already been reported here.
[X] I have confirmed this bug exists on the latest version of pycaret.
[X] I have confirmed this bug exists on the master branch of pycaret (pip install -U git+https://github.com/pycaret/pycaret.git@master).
Issue Description
[Bug]:
blend_models()
andstack_models()
Fail with Certain Models Above 1000 SamplesDescription:
This issue appears to be related to a problem with PyCaret's
blend_models()
andstack_models()
functions when using certain models and exceeding 1000 samples. The error occurs during the finalization stage of model creation, resulting in a "ValueError: cannot set WRITEABLE flag to True of this array" error. This issue does not appear to be related to memory limitations, as testing on systems with 8-24 cores and 32-128 GB of RAM did not solve the problem.Reproducible Code: I've included a reproducible example below.
Affected Models:
The following models cause the error when included in
blend_models()
orstack_models()
:lda
,ridge
,ada
,dt
,gbc
,nb
,et
,rf
,qda
All other models tested (including
svm
,lr
,xgboost
,lightgbm
,knn
,catboost
, anddummy
) work without issue when combined. Furthermore, blending or stacking using a singledt
model also causes the error. This indicates the issue is likely related to how PyCaret handles models like these during the finalization step.Observations:
VotingClassifier
andStackingClassifier
) does not produce this error.memory_profiler
) indicates that the memory usage remains below 400 MB, for both 500 and 750 samples with on a 4-5 MB increase, suggesting that the problem is not memory-related.scikit-learn
versions 1.4.1 and above.References:
Request:
This issue appears to be a bug in PyCaret's
blend_models()
andstack_models()
functions. Any assistance in resolving this would be greatly appreciated.Additional Context:
scikit-learn
.Reproducible Example
Expected Behavior
No Errors
Actual Results
Installed Versions
System: python: 3.11.9 | packaged by conda-forge | (main, Apr 19 2024, 18:36:13) [GCC 12.3.0] executable: /home/cmobley/mambaforge/envs/project_1/bin/python machine: Linux-6.5.0-35-generic-x86_64-with-glibc2.35
PyCaret required dependencies: pip: 24.0 setuptools: 70.0.0 pycaret: 3.3.2 IPython: 8.24.0 ipywidgets: 8.1.2 tqdm: 4.66.4 numpy: 1.26.4 pandas: 2.1.4 jinja2: 3.1.4 scipy: 1.11.4 joblib: 1.3.2 sklearn: 1.4.2 pyod: 1.1.3 imblearn: 0.12.2 category_encoders: 2.6.3 lightgbm: 4.3.0 numba: 0.59.1 requests: 2.32.2 matplotlib: 3.7.5 scikitplot: 0.3.7 yellowbrick: 1.5 plotly: 5.22.0 plotly-resampler: Not installed kaleido: 0.2.1 schemdraw: 0.15 statsmodels: 0.14.2 sktime: 0.26.0 tbats: 1.1.3 pmdarima: 2.0.4 psutil: 5.9.8 markupsafe: 2.1.5 pickle5: Not installed cloudpickle: 3.0.0 deprecation: 2.1.0 xxhash: 3.4.1 wurlitzer: 3.1.0
PyCaret optional dependencies: shap: 0.44.1 interpret: 0.6.1 umap: 0.5.6 ydata_profiling: 4.8.3 explainerdashboard: 0.4.7 autoviz: Not installed fairlearn: 0.7.0 deepchecks: Not installed xgboost: 2.0.3 catboost: 1.2.5 kmodes: 0.12.2 mlxtend: 0.23.1 statsforecast: 1.5.0 tune_sklearn: Not installed ray: Not installed hyperopt: 0.2.7 optuna: 3.6.0 skopt: 0.10.1 mlflow: 2.13.0 gradio: 4.31.5 fastapi: 0.111.0 uvicorn: 0.29.0 m2cgen: 0.10.0 evidently: 0.4.25 fugue: 0.8.7 streamlit: Not installed prophet: Not installed