pycaret / pycaret

An open-source, low-code machine learning library in Python
https://www.pycaret.org
MIT License
8.92k stars 1.77k forks source link

[BUG]: blend_models() and stack_models() Fail with Certain Models Above 1000 Samples #3999

Open CMobley7 opened 5 months ago

CMobley7 commented 5 months ago

pycaret version checks

Issue Description

[Bug]: blend_models() and stack_models() Fail with Certain Models Above 1000 Samples

Description:

This issue appears to be related to a problem with PyCaret's blend_models() and stack_models() functions when using certain models and exceeding 1000 samples. The error occurs during the finalization stage of model creation, resulting in a "ValueError: cannot set WRITEABLE flag to True of this array" error. This issue does not appear to be related to memory limitations, as testing on systems with 8-24 cores and 32-128 GB of RAM did not solve the problem.

Reproducible Code: I've included a reproducible example below.

Affected Models:

The following models cause the error when included in blend_models() or stack_models():

All other models tested (including svm, lr, xgboost, lightgbm, knn, catboost, and dummy) work without issue when combined. Furthermore, blending or stacking using a single dt model also causes the error. This indicates the issue is likely related to how PyCaret handles models like these during the finalization step.

Observations:

References:

Request:

This issue appears to be a bug in PyCaret's blend_models() and stack_models() functions. Any assistance in resolving this would be greatly appreciated.

Additional Context:

Reproducible Example

import pandas as pd
import numpy as np
from pycaret.classification import ClassificationExperiment

# Generate Synthetic Data
train_data = pd.DataFrame(np.random.rand(750, 384))  # Works for 750 or less
# train_data = pd.DataFrame(np.random.rand(5000, 384))  # Fails for 1000 or more
train_data['target'] = np.random.randint(0, 2, len(train_data))

# Ensure a copy is made
train_data = train_data.copy(deep=True) 

# Check Data Writability
print(f"DataFrame Flags Before Setup: {train_data.values.flags}")

exp = ClassificationExperiment()
exp.setup(data=train_data, target='target', session_id=123, fold=3, verbose=True)

print(f"DataFrame Flags After Setup: {train_data.values.flags}")

# Create Models
lr = exp.create_model('lr')
dt = exp.create_model('dt')
knn = exp.create_model('knn')

print(f"DataFrame Flags After Creating Models: {train_data.values.flags}")

# Attempt to blend models (this is where the error likely occurs)
try:
    blended_model = exp.blend_models(estimator_list=[lr, dt, knn])

    # Check DataFrame flags after blending
    print(f"DataFrame Flags After Blending: {train_data.values.flags}")
except ValueError as e:
    print(f"Error: {e}")
    print(f"DataFrame Flags After Blending Error: {train_data.values.flags}")

# Attempt to stack models (this is where the error likely occurs)
try:
    stacked_model = exp.stack_models(estimator_list=[lr, dt, knn], meta_model=LogisticRegression())

    # Check DataFrame flags after stacking
    print(f"DataFrame Flags After Stacking: {train_data.values.flags}")
except ValueError as e:
    print(f"Error: {e}")
    print(f"DataFrame Flags After Stacking Error: {train_data.values.flags}")

# --- (Optional: Manually create and fit models to show no issue) --- 
voting_clf = VotingClassifier(
    estimators=[('lr', lr), ('dt', dt), ('knn', knn)], 
    voting='hard' 
)
voting_clf.fit(train_data.drop('target', axis=1), train_data['target'])

estimators = [
    ('lr', lr),
    ('dt', dt),
    ('knn', knn)
]
stacked_clf = StackingClassifier(
    estimators=estimators, 
    final_estimator=LogisticRegression()
)
stacked_clf.fit(train_data.drop('target', axis=1), train_data['target'])

Expected Behavior

No Errors

Actual Results

Initiated   . . . . . . . . . . . . . . . . . . 18:44:38
Status  . . . . . . . . . . . . . . . . . . Compiling Estimators
Estimator   . . . . . . . . . . . . . . . . . . Voting Classifier
---------------------------------------------------------------------------
_RemoteTraceback                          Traceback (most recent call last)
_RemoteTraceback: 
"""
Traceback (most recent call last):
  File "/home/cmobley/mambaforge/envs/project_1/lib/python3.11/site-packages/joblib/externals/loky/process_executor.py", line 463, in _process_worker
    r = call_item()
        ^^^^^^^^^^^
  File "/home/cmobley/mambaforge/envs/project_1/lib/python3.11/site-packages/joblib/externals/loky/process_executor.py", line 291, in __call__
    return self.fn(*self.args, **self.kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/cmobley/mambaforge/envs/project_1/lib/python3.11/site-packages/joblib/parallel.py", line 589, in __call__
    return [func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^
  File "/home/cmobley/mambaforge/envs/project_1/lib/python3.11/site-packages/joblib/parallel.py", line 589, in <listcomp>
    return [func(*args, **kwargs)
            ^^^^^^^^^^^^^^^^^^^^^
  File "/home/cmobley/mambaforge/envs/project_1/lib/python3.11/site-packages/sklearn/utils/parallel.py", line 129, in __call__
    return self.function(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/cmobley/mambaforge/envs/project_1/lib/python3.11/site-packages/sklearn/ensemble/_base.py", line 36, in _fit_single_estimator
    estimator.fit(X, y)
  File "/home/cmobley/mambaforge/envs/project_1/lib/python3.11/site-packages/sklearn/base.py", line 1474, in wrapper
    return fit_method(estimator, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/cmobley/mambaforge/envs/project_1/lib/python3.11/site-packages/sklearn/tree/_classes.py", line 1009, in fit
    super()._fit(
  File "/home/cmobley/mambaforge/envs/project_1/lib/python3.11/site-packages/sklearn/tree/_classes.py", line 252, in _fit
    X, y = self._validate_data(
           ^^^^^^^^^^^^^^^^^^^^
  File "/home/cmobley/mambaforge/envs/project_1/lib/python3.11/site-packages/sklearn/base.py", line 645, in _validate_data
    X = check_array(X, input_name="X", **check_X_params)
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/cmobley/mambaforge/envs/project_1/lib/python3.11/site-packages/sklearn/utils/validation.py", line 1097, in check_array
    array.flags.writeable = True
    ^^^^^^^^^^^^^^^^^^^^^
ValueError: cannot set WRITEABLE flag to True of this array
"""

The above exception was the direct cause of the following exception:

ValueError                                Traceback (most recent call last)
Cell In[10], line 1
----> 1 blended_model = exp.blend_models(estimator_list=[dt])

File ~/mambaforge/envs/project_1/lib/python3.11/site-packages/pycaret/classification/oop.py:1803, in ClassificationExperiment.blend_models(self, estimator_list, fold, round, choose_better, optimize, method, weights, fit_kwargs, groups, probability_threshold, verbose, return_train_score)
   1699 def blend_models(
   1700     self,
   1701     estimator_list: list,
   (...)
   1712     return_train_score: bool = False,
   1713 ) -> Any:
   1714     """
   1715     This function trains a Soft Voting / Majority Rule classifier for select
   1716     models passed in the ``estimator_list`` param. The output of this function
   (...)
   1800 
   1801     """
-> 1803     return super().blend_models(
   1804         estimator_list=estimator_list,
   1805         fold=fold,
   1806         round=round,
   1807         choose_better=choose_better,
   1808         optimize=optimize,
   1809         method=method,
   1810         weights=weights,
   1811         fit_kwargs=fit_kwargs,
   1812         groups=groups,
   1813         verbose=verbose,
   1814         probability_threshold=probability_threshold,
   1815         return_train_score=return_train_score,
   1816     )

File ~/mambaforge/envs/project_1/lib/python3.11/site-packages/pycaret/internal/pycaret_experiment/supervised_experiment.py:3486, in _SupervisedExperiment.blend_models(self, estimator_list, fold, round, choose_better, optimize, method, weights, fit_kwargs, groups, probability_threshold, verbose, return_train_score)
   3481 display.move_progress()
   3483 self.logger.info(
   3484     "SubProcess create_model() called =================================="
   3485 )
-> 3486 model, model_fit_time = self._create_model(
   3487     estimator=model,
   3488     system=False,
   3489     display=display,
   3490     fold=fold,
   3491     round=round,
   3492     fit_kwargs=fit_kwargs,
   3493     groups=groups,
   3494     probability_threshold=probability_threshold,
   3495     return_train_score=return_train_score,
   3496 )
   3498 model_results = self.pull()
   3499 self.logger.info(
   3500     "SubProcess create_model() end =================================="
   3501 )

File ~/mambaforge/envs/project_1/lib/python3.11/site-packages/pycaret/internal/pycaret_experiment/supervised_experiment.py:1533, in _SupervisedExperiment._create_model(self, estimator, fold, round, cross_validation, predict, fit_kwargs, groups, refit, probability_threshold, experiment_custom_tags, verbose, system, add_to_model_list, X_train_data, y_train_data, metrics, display, model_only, return_train_score, error_score, **kwargs)
   1530         return model, model_fit_time
   1531     return model
-> 1533 model, model_fit_time, model_results, _ = self._create_model_with_cv(
   1534     model=model,
   1535     data_X=data_X,
   1536     data_y=data_y,
   1537     fit_kwargs=fit_kwargs,
   1538     round=round,
   1539     cv=cv,
   1540     groups=groups,
   1541     metrics=metrics,
   1542     refit=refit,
   1543     system=system,
   1544     display=display,
   1545     error_score=error_score,
   1546     return_train_score=return_train_score,
   1547 )
   1549 # end runtime
   1550 runtime_end = time.time()

File ~/mambaforge/envs/project_1/lib/python3.11/site-packages/pycaret/internal/pycaret_experiment/supervised_experiment.py:1223, in _SupervisedExperiment._create_model_with_cv(self, model, data_X, data_y, fit_kwargs, round, cv, groups, metrics, refit, system, display, error_score, return_train_score)
   1221 self.logger.info("Finalizing model")
   1222 with redirect_output(self.logger):
-> 1223     pipeline_with_model.fit(data_X, data_y, **fit_kwargs)
   1224     model_fit_end = time.time()
   1226 # calculating metrics on predictions of complete train dataset

File ~/mambaforge/envs/project_1/lib/python3.11/site-packages/pycaret/internal/pipeline.py:278, in Pipeline.fit(self, X, y, **params)
    276 if self._final_estimator != "passthrough":
    277     last_step_params = routed_params[self.steps[-1][0]]
--> 278     fitted_estimator = self._memory_fit(
    279         clone(self.steps[-1][1]), X, y, **last_step_params["fit"]
    280     )
    281     # Hacky way to make sure that the state of the estimator
    282     # loaded from cache is carried over to the estimator
    283     # in steps
    284     _copy_estimator_state(fitted_estimator, self.steps[-1][1])

File ~/mambaforge/envs/project_1/lib/python3.11/site-packages/joblib/memory.py:353, in NotMemorizedFunc.__call__(self, *args, **kwargs)
    352 def __call__(self, *args, **kwargs):
--> 353     return self.func(*args, **kwargs)

File ~/mambaforge/envs/project_1/lib/python3.11/site-packages/pycaret/internal/pipeline.py:69, in _fit_one(transformer, X, y, message, params)
     67         if "y" in signature(transformer.fit).parameters:
     68             args.append(y)
---> 69         transformer.fit(*args)
     70 return transformer

File ~/mambaforge/envs/project_1/lib/python3.11/site-packages/sklearn/base.py:1474, in _fit_context.<locals>.decorator.<locals>.wrapper(estimator, *args, **kwargs)
   1467     estimator._validate_params()
   1469 with config_context(
   1470     skip_parameter_validation=(
   1471         prefer_skip_nested_validation or global_skip_validation
   1472     )
   1473 ):
-> 1474     return fit_method(estimator, *args, **kwargs)

File ~/mambaforge/envs/project_1/lib/python3.11/site-packages/sklearn/ensemble/_voting.py:366, in VotingClassifier.fit(self, X, y, sample_weight)
    363 self.classes_ = self.le_.classes_
    364 transformed_y = self.le_.transform(y)
--> 366 return super().fit(X, transformed_y, sample_weight)

File ~/mambaforge/envs/project_1/lib/python3.11/site-packages/sklearn/ensemble/_voting.py:89, in _BaseVoting.fit(self, X, y, sample_weight)
     83 if self.weights is not None and len(self.weights) != len(self.estimators):
     84     raise ValueError(
     85         "Number of `estimators` and weights must be equal; got"
     86         f" {len(self.weights)} weights, {len(self.estimators)} estimators"
     87     )
---> 89 self.estimators_ = Parallel(n_jobs=self.n_jobs)(
     90     delayed(_fit_single_estimator)(
     91         clone(clf),
     92         X,
     93         y,
     94         sample_weight=sample_weight,
     95         message_clsname="Voting",
     96         message=self._log_message(names[idx], idx + 1, len(clfs)),
     97     )
     98     for idx, clf in enumerate(clfs)
     99     if clf != "drop"
    100 )
    102 self.named_estimators_ = Bunch()
    104 # Uses 'drop' as placeholder for dropped estimators

File ~/mambaforge/envs/project_1/lib/python3.11/site-packages/sklearn/utils/parallel.py:67, in Parallel.__call__(self, iterable)
     62 config = get_config()
     63 iterable_with_config = (
     64     (_with_config(delayed_func, config), args, kwargs)
     65     for delayed_func, args, kwargs in iterable
     66 )
---> 67 return super().__call__(iterable_with_config)

File ~/mambaforge/envs/project_1/lib/python3.11/site-packages/joblib/parallel.py:1952, in Parallel.__call__(self, iterable)
   1946 # The first item from the output is blank, but it makes the interpreter
   1947 # progress until it enters the Try/Except block of the generator and
   1948 # reach the first `yield` statement. This starts the aynchronous
   1949 # dispatch of the tasks to the workers.
   1950 next(output)
-> 1952 return output if self.return_generator else list(output)

File ~/mambaforge/envs/project_1/lib/python3.11/site-packages/joblib/parallel.py:1595, in Parallel._get_outputs(self, iterator, pre_dispatch)
   1592     yield
   1594     with self._backend.retrieval_context():
-> 1595         yield from self._retrieve()
   1597 except GeneratorExit:
   1598     # The generator has been garbage collected before being fully
   1599     # consumed. This aborts the remaining tasks if possible and warn
   1600     # the user if necessary.
   1601     self._exception = True

File ~/mambaforge/envs/project_1/lib/python3.11/site-packages/joblib/parallel.py:1699, in Parallel._retrieve(self)
   1692 while self._wait_retrieval():
   1693 
   1694     # If the callback thread of a worker has signaled that its task
   1695     # triggered an exception, or if the retrieval loop has raised an
   1696     # exception (e.g. `GeneratorExit`), exit the loop and surface the
   1697     # worker traceback.
   1698     if self._aborting:
-> 1699         self._raise_error_fast()
   1700         break
   1702     # If the next job is not ready for retrieval yet, we just wait for
   1703     # async callbacks to progress.

File ~/mambaforge/envs/project_1/lib/python3.11/site-packages/joblib/parallel.py:1734, in Parallel._raise_error_fast(self)
   1730 # If this error job exists, immediatly raise the error by
   1731 # calling get_result. This job might not exists if abort has been
   1732 # called directly or if the generator is gc'ed.
   1733 if error_job is not None:
-> 1734     error_job.get_result(self.timeout)

File ~/mambaforge/envs/project_1/lib/python3.11/site-packages/joblib/parallel.py:736, in BatchCompletionCallBack.get_result(self, timeout)
    730 backend = self.parallel._backend
    732 if backend.supports_retrieve_callback:
    733     # We assume that the result has already been retrieved by the
    734     # callback thread, and is stored internally. It's just waiting to
    735     # be returned.
--> 736     return self._return_or_raise()
    738 # For other backends, the main thread needs to run the retrieval step.
    739 try:

File ~/mambaforge/envs/project_1/lib/python3.11/site-packages/joblib/parallel.py:754, in BatchCompletionCallBack._return_or_raise(self)
    752 try:
    753     if self.status == TASK_ERROR:
--> 754         raise self._result
    755     return self._result
    756 finally:

ValueError: cannot set WRITEABLE flag to True of this array

Installed Versions

System: python: 3.11.9 | packaged by conda-forge | (main, Apr 19 2024, 18:36:13) [GCC 12.3.0] executable: /home/cmobley/mambaforge/envs/project_1/bin/python machine: Linux-6.5.0-35-generic-x86_64-with-glibc2.35

PyCaret required dependencies: pip: 24.0 setuptools: 70.0.0 pycaret: 3.3.2 IPython: 8.24.0 ipywidgets: 8.1.2 tqdm: 4.66.4 numpy: 1.26.4 pandas: 2.1.4 jinja2: 3.1.4 scipy: 1.11.4 joblib: 1.3.2 sklearn: 1.4.2 pyod: 1.1.3 imblearn: 0.12.2 category_encoders: 2.6.3 lightgbm: 4.3.0 numba: 0.59.1 requests: 2.32.2 matplotlib: 3.7.5 scikitplot: 0.3.7 yellowbrick: 1.5 plotly: 5.22.0 plotly-resampler: Not installed kaleido: 0.2.1 schemdraw: 0.15 statsmodels: 0.14.2 sktime: 0.26.0 tbats: 1.1.3 pmdarima: 2.0.4 psutil: 5.9.8 markupsafe: 2.1.5 pickle5: Not installed cloudpickle: 3.0.0 deprecation: 2.1.0 xxhash: 3.4.1 wurlitzer: 3.1.0

PyCaret optional dependencies: shap: 0.44.1 interpret: 0.6.1 umap: 0.5.6 ydata_profiling: 4.8.3 explainerdashboard: 0.4.7 autoviz: Not installed fairlearn: 0.7.0 deepchecks: Not installed xgboost: 2.0.3 catboost: 1.2.5 kmodes: 0.12.2 mlxtend: 0.23.1 statsforecast: 1.5.0 tune_sklearn: Not installed ray: Not installed hyperopt: 0.2.7 optuna: 3.6.0 skopt: 0.10.1 mlflow: 2.13.0 gradio: 4.31.5 fastapi: 0.111.0 uvicorn: 0.29.0 m2cgen: 0.10.0 evidently: 0.4.25 fugue: 0.8.7 streamlit: Not installed prophet: Not installed

celestinoxp commented 4 months ago

@Yard1 can you take a look?