[BUG]: blend_models() and stack_models() Fail with Certain Models Above 1000 Samples

pycaret version checks

[X] I have checked that this issue has not already been reported here.
[X] I have confirmed this bug exists on the latest version of pycaret.
[X] I have confirmed this bug exists on the master branch of pycaret (pip install -U git+https://github.com/pycaret/pycaret.git@master).

Issue Description

[Bug]: `blend_models()` and `stack_models()` Fail with Certain Models Above 1000 Samples

Description:

This issue appears to be related to a problem with PyCaret's blend_models() and stack_models() functions when using certain models and exceeding 1000 samples. The error occurs during the finalization stage of model creation, resulting in a "ValueError: cannot set WRITEABLE flag to True of this array" error. This issue does not appear to be related to memory limitations, as testing on systems with 8-24 cores and 32-128 GB of RAM did not solve the problem.

Reproducible Code: I've included a reproducible example below.

Affected Models:

The following models cause the error when included in blend_models() or stack_models():

lda, ridge, ada, dt, gbc, nb, et, rf, qda

All other models tested (including svm, lr, xgboost, lightgbm, knn, catboost, and dummy) work without issue when combined. Furthermore, blending or stacking using a single dt model also causes the error. This indicates the issue is likely related to how PyCaret handles models like these during the finalization step.

Observations:

Finalization Error: The error appears when PyCaret is "Finalizing Model" during the blending or stacking process.
Manual Ensemble Success: Manually creating and fitting ensembles (using VotingClassifier and StackingClassifier) does not produce this error.
Dataset Size: The error only occurs when the dataset size exceeds 1000 rows.
Memory Profiler: Memory profiling (using memory_profiler) indicates that the memory usage remains below 400 MB, for both 500 and 750 samples with on a 4-5 MB increase, suggesting that the problem is not memory-related.
Versions: The issue persists across PyCaret versions (including master branch) and scikit-learn versions 1.4.1 and above.
Data Writability: The DataFrame is writable at all stages, before and after setup, model creation, and even after the error occurs.

References:

Chinese Blog Mentioning Similar Issue

Request:

This issue appears to be a bug in PyCaret's blend_models() and stack_models() functions. Any assistance in resolving this would be greatly appreciated.

Additional Context:

The error occurs even with a dataset of just 1000 rows, which is relatively small.
The issue appears to be related to the way PyCaret handles specific models during the finalization process of blending or stacking.
The problem persists across various versions of PyCaret and scikit-learn.

Reproducible Example

import pandas as pd
import numpy as np
from pycaret.classification import ClassificationExperiment

# Generate Synthetic Data
train_data = pd.DataFrame(np.random.rand(750, 384))  # Works for 750 or less
# train_data = pd.DataFrame(np.random.rand(5000, 384))  # Fails for 1000 or more
train_data['target'] = np.random.randint(0, 2, len(train_data))

# Ensure a copy is made
train_data = train_data.copy(deep=True) 

# Check Data Writability
print(f"DataFrame Flags Before Setup: {train_data.values.flags}")

exp = ClassificationExperiment()
exp.setup(data=train_data, target='target', session_id=123, fold=3, verbose=True)

print(f"DataFrame Flags After Setup: {train_data.values.flags}")

# Create Models
lr = exp.create_model('lr')
dt = exp.create_model('dt')
knn = exp.create_model('knn')

print(f"DataFrame Flags After Creating Models: {train_data.values.flags}")

# Attempt to blend models (this is where the error likely occurs)
try:
    blended_model = exp.blend_models(estimator_list=[lr, dt, knn])

    # Check DataFrame flags after blending
    print(f"DataFrame Flags After Blending: {train_data.values.flags}")
except ValueError as e:
    print(f"Error: {e}")
    print(f"DataFrame Flags After Blending Error: {train_data.values.flags}")

# Attempt to stack models (this is where the error likely occurs)
try:
    stacked_model = exp.stack_models(estimator_list=[lr, dt, knn], meta_model=LogisticRegression())

    # Check DataFrame flags after stacking
    print(f"DataFrame Flags After Stacking: {train_data.values.flags}")
except ValueError as e:
    print(f"Error: {e}")
    print(f"DataFrame Flags After Stacking Error: {train_data.values.flags}")

# --- (Optional: Manually create and fit models to show no issue) --- 
voting_clf = VotingClassifier(
    estimators=[('lr', lr), ('dt', dt), ('knn', knn)], 
    voting='hard' 
)
voting_clf.fit(train_data.drop('target', axis=1), train_data['target'])

estimators = [
    ('lr', lr),
    ('dt', dt),
    ('knn', knn)
]
stacked_clf = StackingClassifier(
    estimators=estimators, 
    final_estimator=LogisticRegression()
)
stacked_clf.fit(train_data.drop('target', axis=1), train_data['target'])

Expected Behavior

No Errors

Actual Results

Initiated   . . . . . . . . . . . . . . . . . . 18:44:38
Status  . . . . . . . . . . . . . . . . . . Compiling Estimators
Estimator   . . . . . . . . . . . . . . . . . . Voting Classifier
---------------------------------------------------------------------------
_RemoteTraceback                          Traceback (most recent call last)
_RemoteTraceback: 
"""
Traceback (most recent call last):
  File "/home/cmobley/mambaforge/envs/project_1/lib/python3.11/site-packages/joblib/externals/loky/process_executor.py", line 463, in _process_worker
    r = call_item()
        ^^^^^^^^^^^
  File "/home/cmobley/mambaforge/envs/project_1/lib/python3.11/site-packages/joblib/externals/loky/process_executor.py", line 291, in __call__
    return self.fn(*self.args, **self.kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/cmobley/mambaforge/envs/project_1/lib/python3.11/site-packages/joblib/parallel.py", line 589, in __call__
    return [func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^
  File "/home/cmobley/mambaforge/envs/project_1/lib/python3.11/site-packages/joblib/parallel.py", line 589, in <listcomp>
    return [func(*args, **kwargs)
            ^^^^^^^^^^^^^^^^^^^^^
  File "/home/cmobley/mambaforge/envs/project_1/lib/python3.11/site-packages/sklearn/utils/parallel.py", line 129, in __call__
    return self.function(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/cmobley/mambaforge/envs/project_1/lib/python3.11/site-packages/sklearn/ensemble/_base.py", line 36, in _fit_single_estimator
    estimator.fit(X, y)
  File "/home/cmobley/mambaforge/envs/project_1/lib/python3.11/site-packages/sklearn/base.py", line 1474, in wrapper
    return fit_method(estimator, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/cmobley/mambaforge/envs/project_1/lib/python3.11/site-packages/sklearn/tree/_classes.py", line 1009, in fit
    super()._fit(
  File "/home/cmobley/mambaforge/envs/project_1/lib/python3.11/site-packages/sklearn/tree/_classes.py", line 252, in _fit
    X, y = self._validate_data(
           ^^^^^^^^^^^^^^^^^^^^
  File "/home/cmobley/mambaforge/envs/project_1/lib/python3.11/site-packages/sklearn/base.py", line 645, in _validate_data
    X = check_array(X, input_name="X", **check_X_params)
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/cmobley/mambaforge/envs/project_1/lib/python3.11/site-packages/sklearn/utils/validation.py", line 1097, in check_array
    array.flags.writeable = True
    ^^^^^^^^^^^^^^^^^^^^^
ValueError: cannot set WRITEABLE flag to True of this array
"""

The above exception was the direct cause of the following exception:

ValueError                                Traceback (most recent call last)
Cell In[10], line 1
----> 1 blended_model = exp.blend_models(estimator_list=[dt])

File ~/mambaforge/envs/project_1/lib/python3.11/site-packages/pycaret/classification/oop.py:1803, in ClassificationExperiment.blend_models(self, estimator_list, fold, round, choose_better, optimize, method, weights, fit_kwargs, groups, probability_threshold, verbose, return_train_score)
   1699 def blend_models(
   1700     self,
   1701     estimator_list: list,
   (...)
   1712     return_train_score: bool = False,
   1713 ) -> Any:
   1714     """
   1715     This function trains a Soft Voting / Majority Rule classifier for select
   1716     models passed in the ``estimator_list`` param. The output of this function
   (...)
   1800 
   1801     """
-> 1803     return super().blend_models(
   1804         estimator_list=estimator_list,
   1805         fold=fold,
   1806         round=round,
   1807         choose_better=choose_better,
   1808         optimize=optimize,
   1809         method=method,
   1810         weights=weights,
   1811         fit_kwargs=fit_kwargs,
   1812         groups=groups,
   1813         verbose=verbose,
   1814         probability_threshold=probability_threshold,
   1815         return_train_score=return_train_score,
   1816     )

File ~/mambaforge/envs/project_1/lib/python3.11/site-packages/pycaret/internal/pycaret_experiment/supervised_experiment.py:3486, in _SupervisedExperiment.blend_models(self, estimator_list, fold, round, choose_better, optimize, method, weights, fit_kwargs, groups, probability_threshold, verbose, return_train_score)
   3481 display.move_progress()
   3483 self.logger.info(
   3484     "SubProcess create_model() called =================================="
   3485 )
-> 3486 model, model_fit_time = self._create_model(
   3487     estimator=model,
   3488     system=False,
   3489     display=display,
   3490     fold=fold,
   3491     round=round,
   3492     fit_kwargs=fit_kwargs,
   3493     groups=groups,
   3494     probability_threshold=probability_threshold,
   3495     return_train_score=return_train_score,
   3496 )
   3498 model_results = self.pull()
   3499 self.logger.info(
   3500     "SubProcess create_model() end =================================="
   3501 )

File ~/mambaforge/envs/project_1/lib/python3.11/site-packages/pycaret/internal/pycaret_experiment/supervised_experiment.py:1533, in _SupervisedExperiment._create_model(self, estimator, fold, round, cross_validation, predict, fit_kwargs, groups, refit, probability_threshold, experiment_custom_tags, verbose, system, add_to_model_list, X_train_data, y_train_data, metrics, display, model_only, return_train_score, error_score, **kwargs)
   1530         return model, model_fit_time
   1531     return model
-> 1533 model, model_fit_time, model_results, _ = self._create_model_with_cv(
   1534     model=model,
   1535     data_X=data_X,
   1536     data_y=data_y,
   1537     fit_kwargs=fit_kwargs,
   1538     round=round,
   1539     cv=cv,
   1540     groups=groups,
   1541     metrics=metrics,
   1542     refit=refit,
   1543     system=system,
   1544     display=display,
   1545     error_score=error_score,
   1546     return_train_score=return_train_score,
   1547 )
   1549 # end runtime
   1550 runtime_end = time.time()

File ~/mambaforge/envs/project_1/lib/python3.11/site-packages/pycaret/internal/pycaret_experiment/supervised_experiment.py:1223, in _SupervisedExperiment._create_model_with_cv(self, model, data_X, data_y, fit_kwargs, round, cv, groups, metrics, refit, system, display, error_score, return_train_score)
   1221 self.logger.info("Finalizing model")
   1222 with redirect_output(self.logger):
-> 1223     pipeline_with_model.fit(data_X, data_y, **fit_kwargs)
   1224     model_fit_end = time.time()
   1226 # calculating metrics on predictions of complete train dataset

File ~/mambaforge/envs/project_1/lib/python3.11/site-packages/pycaret/internal/pipeline.py:278, in Pipeline.fit(self, X, y, **params)
    276 if self._final_estimator != "passthrough":
    277     last_step_params = routed_params[self.steps[-1][0]]
--> 278     fitted_estimator = self._memory_fit(
    279         clone(self.steps[-1][1]), X, y, **last_step_params["fit"]
    280     )
    281     # Hacky way to make sure that the state of the estimator
    282     # loaded from cache is carried over to the estimator
    283     # in steps
    284     _copy_estimator_state(fitted_estimator, self.steps[-1][1])

File ~/mambaforge/envs/project_1/lib/python3.11/site-packages/joblib/memory.py:353, in NotMemorizedFunc.__call__(self, *args, **kwargs)
    352 def __call__(self, *args, **kwargs):
--> 353     return self.func(*args, **kwargs)

File ~/mambaforge/envs/project_1/lib/python3.11/site-packages/pycaret/internal/pipeline.py:69, in _fit_one(transformer, X, y, message, params)
     67         if "y" in signature(transformer.fit).parameters:
     68             args.append(y)
---> 69         transformer.fit(*args)
     70 return transformer

File ~/mambaforge/envs/project_1/lib/python3.11/site-packages/sklearn/base.py:1474, in _fit_context.<locals>.decorator.<locals>.wrapper(estimator, *args, **kwargs)
   1467     estimator._validate_params()
   1469 with config_context(
   1470     skip_parameter_validation=(
   1471         prefer_skip_nested_validation or global_skip_validation
   1472     )
   1473 ):
-> 1474     return fit_method(estimator, *args, **kwargs)

File ~/mambaforge/envs/project_1/lib/python3.11/site-packages/sklearn/ensemble/_voting.py:366, in VotingClassifier.fit(self, X, y, sample_weight)
    363 self.classes_ = self.le_.classes_
    364 transformed_y = self.le_.transform(y)
--> 366 return super().fit(X, transformed_y, sample_weight)

File ~/mambaforge/envs/project_1/lib/python3.11/site-packages/sklearn/ensemble/_voting.py:89, in _BaseVoting.fit(self, X, y, sample_weight)
     83 if self.weights is not None and len(self.weights) != len(self.estimators):
     84     raise ValueError(
     85         "Number of `estimators` and weights must be equal; got"
     86         f" {len(self.weights)} weights, {len(self.estimators)} estimators"
     87     )
---> 89 self.estimators_ = Parallel(n_jobs=self.n_jobs)(
     90     delayed(_fit_single_estimator)(
     91         clone(clf),
     92         X,
     93         y,
     94         sample_weight=sample_weight,
     95         message_clsname="Voting",
     96         message=self._log_message(names[idx], idx + 1, len(clfs)),
     97     )
     98     for idx, clf in enumerate(clfs)
     99     if clf != "drop"
    100 )
    102 self.named_estimators_ = Bunch()
    104 # Uses 'drop' as placeholder for dropped estimators

File ~/mambaforge/envs/project_1/lib/python3.11/site-packages/sklearn/utils/parallel.py:67, in Parallel.__call__(self, iterable)
     62 config = get_config()
     63 iterable_with_config = (
     64     (_with_config(delayed_func, config), args, kwargs)
     65     for delayed_func, args, kwargs in iterable
     66 )
---> 67 return super().__call__(iterable_with_config)

File ~/mambaforge/envs/project_1/lib/python3.11/site-packages/joblib/parallel.py:1952, in Parallel.__call__(self, iterable)
   1946 # The first item from the output is blank, but it makes the interpreter
   1947 # progress until it enters the Try/Except block of the generator and
   1948 # reach the first `yield` statement. This starts the aynchronous
   1949 # dispatch of the tasks to the workers.
   1950 next(output)
-> 1952 return output if self.return_generator else list(output)

File ~/mambaforge/envs/project_1/lib/python3.11/site-packages/joblib/parallel.py:1595, in Parallel._get_outputs(self, iterator, pre_dispatch)
   1592     yield
   1594     with self._backend.retrieval_context():
-> 1595         yield from self._retrieve()
   1597 except GeneratorExit:
   1598     # The generator has been garbage collected before being fully
   1599     # consumed. This aborts the remaining tasks if possible and warn
   1600     # the user if necessary.
   1601     self._exception = True

File ~/mambaforge/envs/project_1/lib/python3.11/site-packages/joblib/parallel.py:1699, in Parallel._retrieve(self)
   1692 while self._wait_retrieval():
   1693 
   1694     # If the callback thread of a worker has signaled that its task
   1695     # triggered an exception, or if the retrieval loop has raised an
   1696     # exception (e.g. `GeneratorExit`), exit the loop and surface the
   1697     # worker traceback.
   1698     if self._aborting:
-> 1699         self._raise_error_fast()
   1700         break
   1702     # If the next job is not ready for retrieval yet, we just wait for
   1703     # async callbacks to progress.

File ~/mambaforge/envs/project_1/lib/python3.11/site-packages/joblib/parallel.py:1734, in Parallel._raise_error_fast(self)
   1730 # If this error job exists, immediatly raise the error by
   1731 # calling get_result. This job might not exists if abort has been
   1732 # called directly or if the generator is gc'ed.
   1733 if error_job is not None:
-> 1734     error_job.get_result(self.timeout)

File ~/mambaforge/envs/project_1/lib/python3.11/site-packages/joblib/parallel.py:736, in BatchCompletionCallBack.get_result(self, timeout)
    730 backend = self.parallel._backend
    732 if backend.supports_retrieve_callback:
    733     # We assume that the result has already been retrieved by the
    734     # callback thread, and is stored internally. It's just waiting to
    735     # be returned.
--> 736     return self._return_or_raise()
    738 # For other backends, the main thread needs to run the retrieval step.
    739 try:

File ~/mambaforge/envs/project_1/lib/python3.11/site-packages/joblib/parallel.py:754, in BatchCompletionCallBack._return_or_raise(self)
    752 try:
    753     if self.status == TASK_ERROR:
--> 754         raise self._result
    755     return self._result
    756 finally:

ValueError: cannot set WRITEABLE flag to True of this array

Installed Versions

System: python: 3.11.9 | packaged by conda-forge | (main, Apr 19 2024, 18:36:13) [GCC 12.3.0] executable: /home/cmobley/mambaforge/envs/project_1/bin/python machine: Linux-6.5.0-35-generic-x86_64-with-glibc2.35

PyCaret required dependencies: pip: 24.0 setuptools: 70.0.0 pycaret: 3.3.2 IPython: 8.24.0 ipywidgets: 8.1.2 tqdm: 4.66.4 numpy: 1.26.4 pandas: 2.1.4 jinja2: 3.1.4 scipy: 1.11.4 joblib: 1.3.2 sklearn: 1.4.2 pyod: 1.1.3 imblearn: 0.12.2 category_encoders: 2.6.3 lightgbm: 4.3.0 numba: 0.59.1 requests: 2.32.2 matplotlib: 3.7.5 scikitplot: 0.3.7 yellowbrick: 1.5 plotly: 5.22.0 plotly-resampler: Not installed kaleido: 0.2.1 schemdraw: 0.15 statsmodels: 0.14.2 sktime: 0.26.0 tbats: 1.1.3 pmdarima: 2.0.4 psutil: 5.9.8 markupsafe: 2.1.5 pickle5: Not installed cloudpickle: 3.0.0 deprecation: 2.1.0 xxhash: 3.4.1 wurlitzer: 3.1.0

PyCaret optional dependencies: shap: 0.44.1 interpret: 0.6.1 umap: 0.5.6 ydata_profiling: 4.8.3 explainerdashboard: 0.4.7 autoviz: Not installed fairlearn: 0.7.0 deepchecks: Not installed xgboost: 2.0.3 catboost: 1.2.5 kmodes: 0.12.2 mlxtend: 0.23.1 statsforecast: 1.5.0 tune_sklearn: Not installed ray: Not installed hyperopt: 0.2.7 optuna: 3.6.0 skopt: 0.10.1 mlflow: 2.13.0 gradio: 4.31.5 fastapi: 0.111.0 uvicorn: 0.29.0 m2cgen: 0.10.0 evidently: 0.4.25 fugue: 0.8.7 streamlit: Not installed prophet: Not installed

pycaret / pycaret