Missing values: NaN prediction does not match expectation from dump_model()

ahuber21 commented 1 year ago

Description

For a lightgbm.basic.Booster (regression) that was created using lightgbm.train(), the output of model.predict() does not correspond to the expectation from dump_model().

Reproducible example

import lightgbm
import numpy as np
from sklearn.datasets import make_regression

X, y = make_regression(n_samples=100, n_features=10, random_state=42)
params = {"task": "train", "boosting": "gbdt", "objective": "regression", "num_leaves": 4, "verbose": -1, "n_estimators": 1}
model = lightgbm.train(params, train_set=lightgbm.Dataset(X, y))
X_nan = np.array([np.nan] * 20, dtype=np.float32).reshape(2, 10)
prediction = model.predict(X_nan)[0]

# there's only one tree, go to the left child node until we reach a leaf
# to emulate the missing value case (asserting default_left is True)
node = model.dump_model()["tree_info"][0]["tree_structure"]
while "leaf_value" not in node:
    assert node["default_left"] is True
    node = node["left_child"]
expected = node["leaf_value"]

print(f"{prediction=}, {expected=}")
# prediction=3.6268868912259737, expected=-7.990239036381247

assert(prediction == expected)
# AssertionError

Environment info

LightGBM version or commit hash:

>>> lightgbm.__version__
'4.0.0'

Command(s) you used to install LightGBM

conda install lightgbm

# resulting in
conda list | grep lightgbm
# lightgbm                  4.0.0           py310hc6cd4ac_0    conda-forge

Additional Comments

Edit: Also reproduced with v4.1.0 from PyPI

JSON dump of the tree

```json { "name": "tree", "version": "v4", "num_class": 1, "num_tree_per_iteration": 1, "label_index": 0, "max_feature_idx": 9, "objective": "regression", "average_output": false, "feature_names": [ "Column_0", "Column_1", "Column_2", "Column_3", "Column_4", "Column_5", "Column_6", "Column_7", "Column_8", "Column_9" ], "monotone_constraints": [], "feature_infos": { "Column_0": { "min_value": -2.211135309007885, "max_value": 2.632382064837391, "values": [] }, "Column_1": { "min_value": -2.650969808393012, "max_value": 3.0788808084552377, "values": [] }, "Column_2": { "min_value": -2.6197451040897444, "max_value": 2.5733598032498604, "values": [] }, "Column_3": { "min_value": -3.2412673400690726, "max_value": 2.5600845382687947, "values": [] }, "Column_4": { "min_value": -1.9875689146008928, "max_value": 3.852731490654721, "values": [] }, "Column_5": { "min_value": -2.301921164735585, "max_value": 2.075400798645439, "values": [] }, "Column_6": { "min_value": -2.198805956620082, "max_value": 2.463242112485286, "values": [] }, "Column_7": { "min_value": -1.9187712152990417, "max_value": 2.5269324258736217, "values": [] }, "Column_8": { "min_value": -2.6968866429415717, "max_value": 1.8861859012105302, "values": [] }, "Column_9": { "min_value": -2.4238793266289567, "max_value": 2.4553001399108942, "values": [] } }, "tree_info": [ { "tree_index": 0, "num_leaves": 4, "num_cat": 0, "shrinkage": 1, "tree_structure": { "split_index": 0, "split_feature": 6, "split_gain": 1223880, "threshold": 0.8576786455955732, "decision_type": "<=", "default_left": true, "missing_type": "None", "internal_value": 10.9999, "internal_weight": 0, "internal_count": 100, "left_child": { "split_index": 1, "split_feature": 0, "split_gain": 475907, "threshold": 0.4374422005998846, "decision_type": "<=", "default_left": true, "missing_type": "None", "internal_value": 4.61272, "internal_weight": 75, "internal_count": 75, "left_child": { "split_index": 2, "split_feature": 1, "split_gain": 161949, "threshold": -0.2550224746204132, "decision_type": "<=", "default_left": true, "missing_type": "None", "internal_value": -1.01996, "internal_weight": 50, "internal_count": 50, "left_child": { "leaf_index": 0, "leaf_value": -7.990239036381247, "leaf_weight": 20, "leaf_count": 20 }, "right_child": { "leaf_index": 3, "leaf_value": 3.6268868912259737, "leaf_weight": 30, "leaf_count": 30 } }, "right_child": { "leaf_index": 2, "leaf_value": 15.878088150918485, "leaf_weight": 25, "leaf_count": 25 } }, "right_child": { "leaf_index": 1, "leaf_value": 30.16146995943785, "leaf_weight": 25, "leaf_count": 25 } } } ], "feature_importances": { "Column_0": 1, "Column_1": 1, "Column_6": 1 }, "pandas_categorical": null } ```

jmoralez commented 1 year ago

Hey @ahuber21, thanks for using LightGBM. The prediction is on the right leaf on the last split, you can see the criteria here https://github.com/microsoft/LightGBM/blob/8ed371cee49cf86740b25dd9a4b985a75c9f2dba/python-package/lightgbm/plotting.py#L426-L440 In this case the feature is nan and the missing type is "None", so the value is set to 0 and then compared against the thresholds.

Please let us know if you have further doubts.

jmoralez commented 1 year ago

Just to complement the answer a bit, the missing type is None because you didn't have any missing values in your training set https://github.com/microsoft/LightGBM/blob/8ed371cee49cf86740b25dd9a4b985a75c9f2dba/src/io/bin.cpp#L322-L333

ahuber21 commented 1 year ago

Hi @jmoralez, the docs suggested that NaN would be used

LightGBM uses NA (NaN) to represent missing values by default. Change it to use zero by setting zero_as_missing=true. https://lightgbm.readthedocs.io/en/v4.1.0/Advanced-Topics.html

Moreover, I tried to use X_none = np.array([None] * 20, dtype=np.float32).reshape(2, 10) with the same result.

At this point though I agree this is not a bug. I will modify my code / training sample such that MissingType::NaN will end up being selected. Nevertheless, the behavior feels a bit inconsistent. Maybe the docs can be aligned a bit better with the code.

Thank you!

jmoralez commented 1 year ago

That refers to the training part. If you have NaNs in your training set they will be represented as missing and the missing type will be set to MissingType::NaN (C++ enum). If you don't have any missing values in your training set the missing type will be MissingType::None unless you set zero_as_missing=True. For inference both None and NaN (the python values) should produce the same results.

ahuber21 commented 1 year ago

Thanks for the details.

I was mostly surprised because the behavior was different from similar models, e.g. classifiers from XGBoost. After adding NaN values to my training data set everything works, like you explained. Great!

Please consider this issue resolved. But allow me one more question out of curiosity. It looks like LightGBM is making a couple of assumptions about what is a zero and what is missing. Effectively 0, None, NaN could be either the literal value zero, or missing. That's just a lot of possibilities and I doubt that users will notice when this goes wrong, as the model will still produce valid-looking results. I only discovered my issue/misunderstanding in a unit test. Do you think the average LightGBM user is aware of these intricacies? (Also, how are these prioritized? What happens if there are Nones and NaNs in the training set? What happens when there are neither, but both Nones and NaNs are in the inference data, etc.)

jmoralez commented 1 year ago

Hey. I agree that the rules can be confusing, #2921 was exactly about trying to clarify that. We also have #4040 to warn the user about this behavior, which might have helped you in this case.

About your questions:

python's None and NaN are treated in the same way (they're converted to NaN).
In inference, when there weren't any missing values during training, it happens exactly what happened to you (they're replaced with 0).

microsoft / LightGBM