microsoft / LightGBM

A fast, distributed, high performance gradient boosting (GBT, GBDT, GBRT, GBM or MART) framework based on decision tree algorithms, used for ranking, classification and many other machine learning tasks.
https://lightgbm.readthedocs.io/en/latest/
MIT License
16.69k stars 3.83k forks source link

Error in 'interaction_constraints' parameter logic #5195

Open longdahl opened 2 years ago

longdahl commented 2 years ago

Description

Bug report

interaction constraints* not only limits consequent splits, but all further splits in the tree

Short preamble: This is assuming that the behavior is intended to mirror the behavior of ‘feature interaction constraints’ in xgboost. I assume this is the case as the original feature request refers to xgboost version: https://github.com/microsoft/LightGBM/issues/2884. If the lightgbm implementation is not intended to follow the same logic, I should probably submit this as a feature request instead.

The error in details: When you specify an interaction constraint in LightGBM 3.3.2 it not only prevents the immediate consequent split, but all future splits in the same tree. Specifically, given the constraints [[0,1],[1,2]] a tree that’s starts with a split on feature 0 cannot have any splits containing the feature 2 in ANY further splits in the tree. Of course, a split on feature ‘2’ directly after feature ‘0’ should not be allowed, but it should allow for a split on ‘0’ in the followed by a split on ‘1’ and then finally a split on feature ‘2’. Again, this assumes that the implementation follows the xgboost documentation. See the finale graph in the xgboost documentation that highlights that such pathways are legal: https://xgboost.reagmadthedocs.io/en/stable/tutorials/feature_interaction_constraint.html

*https://lightgbm.readthedocs.io/en/latest/Parameters.html#interaction_constraints

Reproducible example

import lightgbm as lgb
import pandas as pd
import numpy as np
import random

created dummy dataset for test

t = [x for x in range(0,100000)]
t2 = [x + (random.randrange(e+1) / 10) for e,x in enumerate(t)]
t3 = [x + (random.randrange(e+1) / 10) for e,x in enumerate(t)]
t4 = [x + (random.randrange(e+1) / 10) for e,x in enumerate(t)]
test_df = pd.DataFrame([t,t2,t3,t4]).T
test_df.columns = ["Outcome","f1","f2","f3"]
train_data = lgb.Dataset(test_df.drop("Outcome",axis=1), test_df[["Outcome"]], free_raw_data=False)

training model

interaction_constraints = [[0,1],[1,2]]
params = {'interaction_constraints' : interaction_constraints}
m_lgb = lgb.train(params,train_data)

Dumping model object and iterating over each tree


dump = m_lgb.dump_model()
for startnode_split_feature in range(0,3):
  for split_feature in range(0,3):
    for tree_index,tree_object in enumerate(dump['tree_info']):
      had_split = 0
      if tree_object['tree_structure']['split_feature'] == startnode_split_feature:
        if str(tree_object['tree_structure']).find(f"'split_feature': {split_feature}") != -1: #note that the tree object contains all consequtive splits not just the next
          had_split = 1
          print(f"A tree with startnode feature {startnode_split_feature} also contained a split on feature {split_feature}")
          break
    if had_split == 0:
      print(f"No trees with startnode feature {startnode_split_feature} contained a split on feature {split_feature}")

Output of above code block:


A tree with startnode feature 0 also contained a split on feature 0
A tree with startnode feature 0 also contained a split on feature 1
No trees with startnode feature 0 contained a split on feature 2
A tree with startnode feature 1 also contained a split on feature 0
A tree with startnode feature 1 also contained a split on feature 1
A tree with startnode feature 1 also contained a split on feature 2
No trees with startnode feature 2 contained a split on feature 0
A tree with startnode feature 2 also contained a split on feature 1
A tree with startnode feature 2 also contained a split on feature 2

In summary interaction constraints not only limits consequent splits, but all further splits in the tree

While the above is of course anecdotal, it would be highly improbable that no such tree exists. I have also observed the same issue on other datasets.

Environment info

LightGBM 3.3.2

mayer79 commented 2 years ago

The current implementation is how you would naturally define interaction constraints: Each leaf node with see only variables from one variable group allowed to interact. This generates a boosted trees model that respects the interaction constraints. The original XGBoost implementation did not stick to this rule. Following this issue https://github.com/dmlc/xgboost/issues/7115, XGBoost started to forbid overlapping feature sets in the constraints (as this is the only situation where their original implementation was strange).

longdahl commented 2 years ago

Thanks for your reply. I thought the original XGBoost definition of interaction constraints might also have had some very interesting applications, but if its in conflict with the definition of interaction constraints it would of course be a whole different feature all together.