Closed connortann closed 2 months ago
Thanks for taking the time to report this, and for the excellent minimal, reproducible example!
This is very similar to #3636. But totally fine to keep it open, as it has some other details that that report doesn't have.
Using lightgbm==4.3.0
and numpy==2.0.0
, I was able to reproduce this tonight. Adding the text of the ValueError
here, so others experiencing this can find it from search engines.
ValueError: y contains previously unseen labels: [20]
I think @RektPunk's PR (#6524) will resolve this, so I recommend subscribing for notifications there to see when that makes it into a release.
I recognise that using the multi-class objective is unusual in the context of binary classification.
It is. And in the case of LightGBM, it'll roughly double the training time and the model's physical size (in memory and on disk), compared to using a binary classification objective, because LightGBM will train 2 trees per boosting round.
But still... the experience here should be better than that hard-to-understand ValueError
, so we appreciate the report!
It seems that the reason for the ValueError is related to a shape issue. The value of self.n_classes_
is 2, and suppose that
>>> result
array([[0.45, 0.55],
[0.45, 0.55],
...
[0.45, 0.55]])
is returned. Depending on the condition, the following value is returned:
>>> np.vstack((1. - result, result)).transpose()
array([[0.55, 0.55, 0.55, ..., 0.45, 0.45, 0.45],
[0.45, 0.45, 0.45, ..., 0.55, 0.55, 0.55]])
>>> np.argmax(result_vstack, axis=1)
array([0, 20])
This value is passed to self._le.inverse_transform(class_index)
, causing the ValueError.
Description
The scikit-learn interface throws an error when used with
objective="multiclass"
for a binary classification task.Reproducible example
Traceback:
Environment info
LightGBM version or commit hash:
4.3.0
Command(s) you used to install LightGBM:
pip install lightgbm
OS: Linux
Additional Comments
I recognise that using the multi-class objective is unusual in the context of binary classification.
The usual lightGBM interface works as expected:
As a bit of context. I am investigating some issues in the
shap
project which relate to how the package handles multi-class predictions. There is some inconsistency between the various ML libraries in the shape of returned predictions for binary classification:num_samples x num_classes
, or justnum_samples
.