Column order impacting the predictions of LGBM (regression)

microsoft / LightGBM

A fast, distributed, high performance gradient boosting (GBT, GBDT, GBRT, GBM or MART) framework based on decision tree algorithms, used for ranking, classification and many other machine learning tasks.

MIT License

16.61k stars 3.83k forks source link

import pandas as pd import sklearn from sklearn.datasets import fetch_california_housing from sklearn.model_selection import train_test_split from sklearn.metrics import mean_squared_error import lightgbm as lgb california = fetch_california_housing() X = pd.DataFrame(california.data, columns=california.feature_names) y = pd.Series(california.target, name="target") X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) # feature set #1 # features_set = ['HouseAge', 'AveRooms', 'AveBedrms', 'Population', 'AveOccup', 'Latitude', 'Longitude', 'MedInc'] # feature set #2 features_set = ["Longitude", "HouseAge", "AveRooms", "AveBedrms", "Population", "AveOccup", "Latitude", "MedInc",] # params 1 # params = { # "verbose": -1, # "seed": 42, # } # params 2 params = { "boosting_type": "gbdt", "max_depth": 4, "bagging_fraction": 1.0, "bagging_freq": 0, "feature_fraction": 1.0, "learning_rate": 0.019324, "num_leaves": 128, "min_data_in_leaf": 16, "max_bin": 90, "num_iterations": 267, "min_gain_to_split": 0.0, "lambda_l1": 0.001356, "lambda_l2": 0.000581, "verbose": -1, "seed": 42, "num_thread": 1, "deterministic": True, "force_row_wise": True, } train_data_1 = lgb.Dataset(X_train, label=y_train) model_1 = lgb.train(params, train_data_1) y_pred_1 = model_1.predict(X_test) mse_1 = mean_squared_error(y_test, y_pred_1) train_data_2 = lgb.Dataset(X_train[features_set], label=y_train) model_2 = lgb.train(params, train_data_2) y_pred_2 = model_2.predict(X_test[features_set]) mse_2 = mean_squared_error(y_test, y_pred_2) print(mse_1 == mse_2)

Thanks for using LightGBM, and for taking the time to put together an excellent reproducible example!

Short Answer

During tree-building, LightGBM looks at multiple "splits", (feature, threshold) pairs. For each candidate, it computes a "gain" , basically improvement in the in-sample fit as a result of splitting the data on that feature and threshold.

If there are multiple splits that produce the "best" gain, LightGBM will just choose the "first" one, which will generally mean a split from a feature appearing "earlier" (lower column index, or "further left") in the training data.

Longer Answer

I've narrowed it down to a smaller example that reproduces the behavior, to help us focus on the root cause:

check.py (click me)

```python import pandas as pd import sklearn from sklearn.datasets import fetch_california_housing from sklearn.model_selection import train_test_split from sklearn.metrics import mean_squared_error import lightgbm as lgb california = fetch_california_housing() X = pd.DataFrame(california.data, columns=california.feature_names) y = pd.Series(california.target, name="target") X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) features1 = ["HouseAge", "AveRooms", "AveBedrms", "Population", "AveOccup", "Latitude", "Longitude", "MedInc"] features2 = ["Longitude", "HouseAge", "AveRooms", "AveBedrms", "Population", "AveOccup", "Latitude", "MedInc"] params = { "max_depth": 4, "learning_rate": 0.019324, "min_data_in_leaf": 16, "num_iterations": 200, "verbose": -1, "seed": 42, "num_thread": 1, "deterministic": True, "force_row_wise": True, } train_data_1 = lgb.Dataset(X_train[features1], label=y_train) model_1 = lgb.train(params, train_data_1) y_pred_1 = model_1.predict(X_test[features1]) mse_1 = mean_squared_error(y_test, y_pred_1) model_1.save_model("model_1.txt") train_data_2 = lgb.Dataset(X_train[features2], label=y_train) model_2 = lgb.train(params, train_data_2) model_2.save_model("model_2.txt") y_pred_2 = model_2.predict(X_test[features2]) mse_2 = mean_squared_error(y_test, y_pred_2) assert mse_1 == mse_2, f"mse_1 ({mse_1}) != mse_2 ({mse_2})" ```

In that code snippet, notice that I've also added saving the models out (in text format). I compared those in a text differ, and saw the following in the summary near the end:

The default "importance" reported there is "number of splits the feature is chosen for".

https://github.com/microsoft/LightGBM/blob/668bf5dadf1eb9a846302b2b76a313fbbef52870/python-package/lightgbm/basic.py#L4457-L4459

Notice that for the model where Longitude appears earlier in the feature list, it is chosen for 6 more splits. In the model where Latitude appears earlier, it's chosen for 6 more splits.

I suspect there are some regions of the distribution where it's possible to draw a split for Longitude or Latitude which select the exact same samples. You may have only observed this with what you called "params 2" because in general those parameters encourage LightGBM to grow more and deeper trees than it would by default.

more trees:

num_iterations = 267 (default: 100)

deeper trees:

num_leaves: 128 (default: 31)
min_data_in_leaf: 16 (default: 20)

microsoft / LightGBM

Column order impacting the predictions of LGBM (regression) #6671

Short Answer

Longer Answer