microsoft / LightGBM

A fast, distributed, high performance gradient boosting (GBT, GBDT, GBRT, GBM or MART) framework based on decision tree algorithms, used for ranking, classification and many other machine learning tasks.
https://lightgbm.readthedocs.io/en/latest/
MIT License
16.57k stars 3.82k forks source link

[python-package] ValueError when using scikit-learn API for multiclass binary classification #6519

Closed connortann closed 2 months ago

connortann commented 3 months ago

Description

The scikit-learn interface throws an error when used with objective="multiclass" for a binary classification task.

Reproducible example

import numpy as np
import lightgbm as lgb

num_examples, num_features = 20, 3
X = np.random.uniform(size=[num_examples, num_features])
y = np.random.choice([0, 1], size=num_examples)
model = lgb.LGBMClassifier(objective="multiclass", num_classes=2).fit(X, y)
model.predict(X)  # Raises ValueError

Traceback:

```python --------------------------------------------------------------------------- ValueError Traceback (most recent call last) Cell In[19], line 1 ----> 1 model.predict(X) File ~/miniforge3/envs/shap/lib/python3.11/site-packages/lightgbm/sklearn.py:1237, in LGBMClassifier.predict(self, X, raw_score, start_iteration, num_iteration, pred_leaf, pred_contrib, validate_features, **kwargs) 1235 else: 1236 class_index = np.argmax(result, axis=1) -> 1237 return self._le.inverse_transform(class_index) File ~/miniforge3/envs/shap/lib/python3.11/site-packages/sklearn/preprocessing/_label.py:160, in LabelEncoder.inverse_transform(self, y) 158 diff = np.setdiff1d(y, np.arange(len(self.classes_))) 159 if len(diff): --> 160 raise ValueError("y contains previously unseen labels: %s" % str(diff)) 161 y = np.asarray(y) 162 return self.classes_[y] ValueError: y contains previously unseen labels: [20] ```

Environment info

LightGBM version or commit hash: 4.3.0

Command(s) you used to install LightGBM: pip install lightgbm

OS: Linux

Additional Comments

I recognise that using the multi-class objective is unusual in the context of binary classification.

The usual lightGBM interface works as expected:

model = lgb.train(dict(objective="multiclass", num_classes=2), lgb.Dataset(X, label=y))
model.predict(X)  # Succeeds

As a bit of context. I am investigating some issues in the shap project which relate to how the package handles multi-class predictions. There is some inconsistency between the various ML libraries in the shape of returned predictions for binary classification: num_samples x num_classes, or just num_samples.

jameslamb commented 3 months ago

Thanks for taking the time to report this, and for the excellent minimal, reproducible example!

This is very similar to #3636. But totally fine to keep it open, as it has some other details that that report doesn't have.

Using lightgbm==4.3.0 and numpy==2.0.0, I was able to reproduce this tonight. Adding the text of the ValueError here, so others experiencing this can find it from search engines.

ValueError: y contains previously unseen labels: [20]

I think @RektPunk's PR (#6524) will resolve this, so I recommend subscribing for notifications there to see when that makes it into a release.

I recognise that using the multi-class objective is unusual in the context of binary classification.

It is. And in the case of LightGBM, it'll roughly double the training time and the model's physical size (in memory and on disk), compared to using a binary classification objective, because LightGBM will train 2 trees per boosting round.

But still... the experience here should be better than that hard-to-understand ValueError, so we appreciate the report!

RektPunk commented 3 months ago

It seems that the reason for the ValueError is related to a shape issue. The value of self.n_classes_ is 2, and suppose that

>>> result
array([[0.45, 0.55],
       [0.45, 0.55],
        ...
       [0.45, 0.55]])

is returned. Depending on the condition, the following value is returned:

>>> np.vstack((1. - result, result)).transpose()
array([[0.55, 0.55, 0.55, ..., 0.45, 0.45, 0.45],
       [0.45, 0.45, 0.45, ..., 0.55, 0.55, 0.55]])
>>> np.argmax(result_vstack, axis=1)
array([0, 20])

This value is passed to self._le.inverse_transform(class_index), causing the ValueError.