mljar / mljar-supervised

Python package for AutoML on Tabular Data with Feature Engineering, Hyper-Parameters Tuning, Explanations and Automatic Documentation
https://mljar.com
MIT License
3k stars 401 forks source link

Error for CatBoost: features data: pandas.DataFrame column 'store_type_pdist' has dtype 'category' but is not in cat_features list #383

Closed xiaobo closed 3 years ago

xiaobo commented 3 years ago

I have a dataset X_train, features data: pandas.DataFrame column 'store_type_pdist' has dtype 'category', with numerical values like 0, 1, 2;

when running code like:

automl = AutoML(mode="Perform")
automl.fit(X_train, y_train); 

it get the following error,please help to resolve, thanks

## Error for 3_Default_CatBoost

features data: pandas.DataFrame column 'store_type_pdist' has dtype 'category' but is not in  cat_features list
Traceback (most recent call last):
  File "/home/user/workspace/aitiaexplorer/uplift/automl/lib/python3.8/site-packages/supervised/base_automl.py", line 1074, in _fit
    trained = self.train_model(params)
  File "/home/user/workspace/aitiaexplorer/uplift/automl/lib/python3.8/site-packages/supervised/base_automl.py", line 363, in train_model
    self.keep_model(mf, model_subpath)
  File "/home/user/workspace/aitiaexplorer/uplift/automl/lib/python3.8/site-packages/supervised/base_automl.py", line 262, in keep_model
    self._base_predict(self._one_sample, model)
  File "/home/user/workspace/aitiaexplorer/uplift/automl/lib/python3.8/site-packages/supervised/base_automl.py", line 1265, in _base_predict
    predictions = model.predict(X)
  File "/home/user/workspace/aitiaexplorer/uplift/automl/lib/python3.8/site-packages/supervised/model_framework.py", line 387, in predict
    y_p = learner.predict(X_data)
  File "/home/user/workspace/aitiaexplorer/uplift/automl/lib/python3.8/site-packages/supervised/algorithms/catboost.py", line 275, in predict
    return self.model.predict(X, ntree_end=self.best_ntree_limit)
  File "/home/user/workspace/aitiaexplorer/uplift/automl/lib/python3.8/site-packages/catboost/core.py", line 4894, in predict
    return self._predict(data, prediction_type, ntree_start, ntree_end, thread_count, verbose, 'predict')
  File "/home/user/workspace/aitiaexplorer/uplift/automl/lib/python3.8/site-packages/catboost/core.py", line 1978, in _predict
    data, data_is_single_object = self._process_predict_input_data(data, parent_method_name, thread_count)
  File "/home/user/workspace/aitiaexplorer/uplift/automl/lib/python3.8/site-packages/catboost/core.py", line 1958, in _process_predict_input_data
    data = Pool(
  File "/home/user/workspace/aitiaexplorer/uplift/automl/lib/python3.8/site-packages/catboost/core.py", line 455, in __init__
    self._init(data, label, cat_features, text_features, embedding_features, pairs, weight, group_id, group_weight, subgroup_id, pairs_weight, baseline, feature_names, thread_count)
  File "/home/user/workspace/aitiaexplorer/uplift/automl/lib/python3.8/site-packages/catboost/core.py", line 966, in _init
    self._init_pool(data, label, cat_features, text_features, embedding_features, pairs, weight, group_id, group_weight, subgroup_id, pairs_weight, baseline, feature_names, thread_count)
  File "_catboost.pyx", line 3550, in _catboost._PoolBase._init_pool
  File "_catboost.pyx", line 3597, in _catboost._PoolBase._init_pool
  File "_catboost.pyx", line 3438, in _catboost._PoolBase._init_features_order_layout_pool
  File "_catboost.pyx", line 2433, in _catboost._set_features_order_data_pd_data_frame
_catboost.CatBoostError: features data: pandas.DataFrame column 'store_type_pdist' has dtype 'category' but is not in  cat_features list

Please set a GitHub issue with above error message at: https://github.com/mljar/mljar-supervised/issues/new

software version: Package Version


alembic 1.5.8 attrs 20.3.0 backcall 0.2.0 catboost 0.24.4 category-encoders 2.2.2 cliff 3.7.0 cloudpickle 1.3.0 cmaes 0.8.2 cmd2 1.5.0 colorama 0.4.4 colorlog 5.0.1 colour 0.1.5 cycler 0.10.0 decorator 5.0.7 dill 0.3.3 dtreeviz 1.0 graphviz 0.16 greenlet 1.0.0 iniconfig 1.1.1 ipykernel 5.5.3 ipython 7.22.0 ipython-genutils 0.2.0 jedi 0.18.0 joblib 1.0.1 jupyter-client 6.2.0 jupyter-core 4.7.1 kiwisolver 1.3.1 lightgbm 3.0.0 llvmlite 0.36.0 Mako 1.1.4 MarkupSafe 1.1.1 matplotlib 3.4.1 mljar-supervised 0.10.3 nest-asyncio 1.5.1 numba 0.53.1 numpy 1.19.5 optuna 2.6.0 packaging 20.9 pandas 1.2.0 parso 0.8.2 patsy 0.5.1 pbr 5.5.1 pexpect 4.8.0 pickleshare 0.7.5 Pillow 8.2.0 pip 21.0.1 plotly 4.14.3 pluggy 0.13.1 prettytable 2.1.0 prompt-toolkit 3.0.18 ptyprocess 0.7.0 py 1.10.0 pyarrow 3.0.0 pyfunctional 1.4.3 Pygments 2.8.1 pyparsing 2.4.7 pyperclip 1.8.2 pytest 6.2.3 python-dateutil 2.8.1 python-editor 1.0.4 pytz 2021.1 PyYAML 5.4.1 pyzmq 22.0.3 retrying 1.3.3 scikit-learn 0.24.1 scipy 1.6.1 seaborn 0.10.1 setuptools 47.1.0 shap 0.36.0 six 1.15.0 slicer 0.0.7 SQLAlchemy 1.4.11 statsmodels 0.12.2 stevedore 3.3.0 tabulate 0.8.7 threadpoolctl 2.1.0 toml 0.10.2 tornado 6.1 tqdm 4.60.0 traitlets 5.0.5 wcwidth 0.2.5 wordcloud 1.8.1 xgboost 1.3.3

pplonski commented 3 years ago

@xiaobo thank you for reporting the issue. It looks like a bug.

Could you please try to cast all your category columns to object type and try to train again.

Here is how to cast to object type:

 X_train['store_type_pdist'] = X_train['store_type_pdist'].astype(str)
xiaobo commented 3 years ago

Thanks for your help, I have worked it around.

pplonski commented 3 years ago

I didn't reproduce the bug from the issue exactly. I've added a direct category column recognition as a categorical feature in AutoML. It should work now without a workaround. Changes are in the dev branch.

pplonski commented 3 years ago

@xiaobo I reproduced and fixed this issue. Changes in the dev branch.