Open ahuber21 opened 1 year ago
Hey @ahuber21, thanks for using LightGBM. The prediction is on the right leaf on the last split, you can see the criteria here https://github.com/microsoft/LightGBM/blob/8ed371cee49cf86740b25dd9a4b985a75c9f2dba/python-package/lightgbm/plotting.py#L426-L440 In this case the feature is nan and the missing type is "None", so the value is set to 0 and then compared against the thresholds.
Please let us know if you have further doubts.
Just to complement the answer a bit, the missing type is None because you didn't have any missing values in your training set https://github.com/microsoft/LightGBM/blob/8ed371cee49cf86740b25dd9a4b985a75c9f2dba/src/io/bin.cpp#L322-L333
Hi @jmoralez, the docs suggested that NaN
would be used
LightGBM uses NA (NaN) to represent missing values by default. Change it to use zero by setting
zero_as_missing=true
. https://lightgbm.readthedocs.io/en/v4.1.0/Advanced-Topics.html
Moreover, I tried to use X_none = np.array([None] * 20, dtype=np.float32).reshape(2, 10)
with the same result.
At this point though I agree this is not a bug. I will modify my code / training sample such that MissingType::NaN
will end up being selected.
Nevertheless, the behavior feels a bit inconsistent. Maybe the docs can be aligned a bit better with the code.
Thank you!
That refers to the training part. If you have NaNs
in your training set they will be represented as missing and the missing type will be set to MissingType::NaN
(C++ enum). If you don't have any missing values in your training set the missing type will be MissingType::None
unless you set zero_as_missing=True
. For inference both None
and NaN
(the python values) should produce the same results.
Thanks for the details.
I was mostly surprised because the behavior was different from similar models, e.g. classifiers from XGBoost. After adding NaN
values to my training data set everything works, like you explained. Great!
Please consider this issue resolved. But allow me one more question out of curiosity.
It looks like LightGBM is making a couple of assumptions about what is a zero and what is missing. Effectively 0
, None
, NaN
could be either the literal value zero, or missing. That's just a lot of possibilities and I doubt that users will notice when this goes wrong, as the model will still produce valid-looking results. I only discovered my issue/misunderstanding in a unit test. Do you think the average LightGBM user is aware of these intricacies?
(Also, how are these prioritized? What happens if there are None
s and NaN
s in the training set? What happens when there are neither, but both None
s and NaN
s are in the inference data, etc.)
Hey. I agree that the rules can be confusing, #2921 was exactly about trying to clarify that. We also have #4040 to warn the user about this behavior, which might have helped you in this case.
About your questions:
None
and NaN
are treated in the same way (they're converted to NaN
).
Description
For a
lightgbm.basic.Booster
(regression) that was created usinglightgbm.train()
, the output ofmodel.predict()
does not correspond to the expectation fromdump_model()
.Reproducible example
Environment info
LightGBM version or commit hash:
Command(s) you used to install LightGBM
Additional Comments
Edit: Also reproduced with v4.1.0 from PyPI
JSON dump of the tree
```json { "name": "tree", "version": "v4", "num_class": 1, "num_tree_per_iteration": 1, "label_index": 0, "max_feature_idx": 9, "objective": "regression", "average_output": false, "feature_names": [ "Column_0", "Column_1", "Column_2", "Column_3", "Column_4", "Column_5", "Column_6", "Column_7", "Column_8", "Column_9" ], "monotone_constraints": [], "feature_infos": { "Column_0": { "min_value": -2.211135309007885, "max_value": 2.632382064837391, "values": [] }, "Column_1": { "min_value": -2.650969808393012, "max_value": 3.0788808084552377, "values": [] }, "Column_2": { "min_value": -2.6197451040897444, "max_value": 2.5733598032498604, "values": [] }, "Column_3": { "min_value": -3.2412673400690726, "max_value": 2.5600845382687947, "values": [] }, "Column_4": { "min_value": -1.9875689146008928, "max_value": 3.852731490654721, "values": [] }, "Column_5": { "min_value": -2.301921164735585, "max_value": 2.075400798645439, "values": [] }, "Column_6": { "min_value": -2.198805956620082, "max_value": 2.463242112485286, "values": [] }, "Column_7": { "min_value": -1.9187712152990417, "max_value": 2.5269324258736217, "values": [] }, "Column_8": { "min_value": -2.6968866429415717, "max_value": 1.8861859012105302, "values": [] }, "Column_9": { "min_value": -2.4238793266289567, "max_value": 2.4553001399108942, "values": [] } }, "tree_info": [ { "tree_index": 0, "num_leaves": 4, "num_cat": 0, "shrinkage": 1, "tree_structure": { "split_index": 0, "split_feature": 6, "split_gain": 1223880, "threshold": 0.8576786455955732, "decision_type": "<=", "default_left": true, "missing_type": "None", "internal_value": 10.9999, "internal_weight": 0, "internal_count": 100, "left_child": { "split_index": 1, "split_feature": 0, "split_gain": 475907, "threshold": 0.4374422005998846, "decision_type": "<=", "default_left": true, "missing_type": "None", "internal_value": 4.61272, "internal_weight": 75, "internal_count": 75, "left_child": { "split_index": 2, "split_feature": 1, "split_gain": 161949, "threshold": -0.2550224746204132, "decision_type": "<=", "default_left": true, "missing_type": "None", "internal_value": -1.01996, "internal_weight": 50, "internal_count": 50, "left_child": { "leaf_index": 0, "leaf_value": -7.990239036381247, "leaf_weight": 20, "leaf_count": 20 }, "right_child": { "leaf_index": 3, "leaf_value": 3.6268868912259737, "leaf_weight": 30, "leaf_count": 30 } }, "right_child": { "leaf_index": 2, "leaf_value": 15.878088150918485, "leaf_weight": 25, "leaf_count": 25 } }, "right_child": { "leaf_index": 1, "leaf_value": 30.16146995943785, "leaf_weight": 25, "leaf_count": 25 } } } ], "feature_importances": { "Column_0": 1, "Column_1": 1, "Column_6": 1 }, "pandas_categorical": null } ```