microsoft / LightGBM

A fast, distributed, high performance gradient boosting (GBT, GBDT, GBRT, GBM or MART) framework based on decision tree algorithms, used for ranking, classification and many other machine learning tasks.
https://lightgbm.readthedocs.io/en/latest/
MIT License
16.57k stars 3.82k forks source link

Different feature importances on different feature order even with deterministic params #6069

Open upwindowship opened 1 year ago

upwindowship commented 1 year ago

Description

We've run into an issue where identical input data produces different feature importance if the column order is different. This happens even with feature_fraction: 1.0, 'deterministic': True, 'force_row_wise': True so it doesn't seem like an issue of subsampling.

Reproducible example

Here is example data on which we were able to reproduce it: https://github.com/upgini/upgini/blob/add-lgbm-example/notebooks/lgbm_example_data.csv.zip

import lightgbm as lgb
import pandas as pd
import numpy as np

data = pd.read_csv("lgbm_example_data.csv.zip")
params = {'objective': 'huber', 'verbosity': -1, 'random_seed': 10, 'feature_fraction': 1.0, 'deterministic': True, 'force_row_wise': True}
train_columns = sorted([f for f in data.columns if f.startswith("f_")])

def train(x_train, columns):
    d_train = lgb.Dataset(x_train[columns], label=x_train["target"])
    bst = lgb.train(params, train_set=d_train)
    return dict(zip(bst.feature_name(), bst.feature_importance())) 

splits1 = train(data, train_columns)

rng = np.random.default_rng(42)
train_columns_shuffled = train_columns.copy() 
rng.shuffle(train_columns_shuffled)

splits2 = train(data, train_columns_shuffled)

for k in set(splits1.keys()).union(set(splits2.keys())):
     v1 = splits1.get(k)
     v2 = splits2.get(k)
     if v1 != v2:
        print(f"{k}: {v1} vs {v2}")

-----
f_139: 491 vs 485
f_110: 15 vs 5
f_188: 290 vs 23
f_189: 13 vs 296

This param set produces less variations, but the results are still different:

params = {'objective': 'huber', 'verbosity': -1, 'random_seed': 10, 'max_depth': 4, 'num_leaves': 16, 'max_cat_threshold': 80, 'min_data_per_group': 25, 'cat_l2': 10, 'cat_smooth': 12, 'num_boost_round': 100, 'learning_rate': 0.1, 'min_sum_hessian_in_leaf': 5, 'feature_fraction': 1.0, 'deterministic': True, 'force_row_wise': True}
---
f_188: 27 vs 14
f_189: 0 vs 13

Environment info

LightGBM version or commit hash: Tested both on 3.3.5 and 4.0.0

Command(s) you used to install LightGBM

pip install lightgbm

x86 build

Additional Comments

jameslamb commented 1 year ago

Thanks for using LightGBM.

Before we investigate this... please see very-similar discussions in this project's issue tracker:

Once you've read those, if you're certain none of the advice there applies to your situation, let us know and someone will take a closer look.

upwindowship commented 1 year ago

I've seen these issues before writing – sadly, none of them relate to our case. We care here about feature importance stability and not scores, because we're using feature importance in our feature selection algorithm. The example data is not duplicated, it has 11k rows, and we don't use parameters that bring randomness.

upwindowship commented 1 year ago

@jameslamb Is there any news on this? Seems like this is a bug, at least from what I expect from the documentation.

jameslamb commented 1 year ago

Please don't leave "any updates on this?" types of comments in this project. If you're interested in investigating this and trying to find and fix the root cause, or if you have new information to add, we'd be grateful for the help.

Otherwise, being subscribed to the issue is sufficient guarantee that you'll be notified if something around it changes.