microsoft / FLAML

A fast library for AutoML and tuning. Join our Discord: https://discord.gg/Cppx2vSPVP.
https://microsoft.github.io/FLAML/
MIT License
3.92k stars 510 forks source link

flaml.default.LGBMRegressor feature_name_ property problem #1168

Closed lizhuoq closed 1 year ago

lizhuoq commented 1 year ago

when i use flaml.default.LGBMRegressor, i found The order of features changes.
I use it in my own data set.

model = flaml.default.LGBMRegressor()
model.fit(X_train, y_train)

print(model.feature_name_)
print(X_train.columns)
print(model.feature_name_ == X_train.columns)

print(f"train score : {model.score(X_train, y_train):.2f}")
print(f"test score : {model.score(X_test, y_test):.2f}")

output:

['landCover', 'total_precipitation', 'volumetric_soil_water_layer_2', 'leaf_area_index_low_vegetation', 'potential_evaporation', 'volumetric_soil_water_layer_1', 'leaf_area_index_high_vegetation', 'volumetric_soil_water_layer_3', '2m_temperature', 'total_evaporation', 'lai', 'dem', 'BD1_1', 'BD1_2', 'SAND1_1', 'SAND1_2', 'CLAY1_1', 'CLAY1_2', 'SILT1_1', 'SILT1_2', 'GRAV1_1', 'GRAV1_2']
Index(['total_precipitation', 'volumetric_soil_water_layer_2',
       'leaf_area_index_low_vegetation', 'potential_evaporation',
       'volumetric_soil_water_layer_1', 'leaf_area_index_high_vegetation',
       'volumetric_soil_water_layer_3', '2m_temperature', 'total_evaporation',
       'lai', 'dem', 'landCover', 'BD1_1', 'BD1_2', 'SAND1_1', 'SAND1_2',
       'CLAY1_1', 'CLAY1_2', 'SILT1_1', 'SILT1_2', 'GRAV1_1', 'GRAV1_2'],
      dtype='object')
[False False False False False False False False False False False False
  True  True  True  True  True  True  True  True  True  True]
train score : 0.88
test score : 0.86

but when i use lightgbm.LGBMRegressor, everything in order!

# lightgbm  
model = lightgbm.LGBMRegressor()

model.fit(X_train, y_train)

print(model.feature_name_)
print(X_train.columns)
print(model.feature_name_ == X_train.columns)

print(f"train score : {model.score(X_train, y_train):.2f}")
print(f"test score : {model.score(X_test, y_test):.2f}")

output

['total_precipitation', 'volumetric_soil_water_layer_2', 'leaf_area_index_low_vegetation', 'potential_evaporation', 'volumetric_soil_water_layer_1', 'leaf_area_index_high_vegetation', 'volumetric_soil_water_layer_3', '2m_temperature', 'total_evaporation', 'lai', 'dem', 'landCover', 'BD1_1', 'BD1_2', 'SAND1_1', 'SAND1_2', 'CLAY1_1', 'CLAY1_2', 'SILT1_1', 'SILT1_2', 'GRAV1_1', 'GRAV1_2']
Index(['total_precipitation', 'volumetric_soil_water_layer_2',
       'leaf_area_index_low_vegetation', 'potential_evaporation',
       'volumetric_soil_water_layer_1', 'leaf_area_index_high_vegetation',
       'volumetric_soil_water_layer_3', '2m_temperature', 'total_evaporation',
       'lai', 'dem', 'landCover', 'BD1_1', 'BD1_2', 'SAND1_1', 'SAND1_2',
       'CLAY1_1', 'CLAY1_2', 'SILT1_1', 'SILT1_2', 'GRAV1_1', 'GRAV1_2'],
      dtype='object')
[ True  True  True  True  True  True  True  True  True  True  True  True
  True  True  True  True  True  True  True  True  True  True]
train score : 0.86
test score : 0.85

I don't know what's going on. Can you help me sort it out?
Thanks!

lizhuoq commented 1 year ago

my flaml version is 1.2.4 my lightgbm version is 3.3.5

sonichi commented 1 year ago

It's because the categorical and numeric features are reordered and grouped together by data preprocessing.

lizhuoq commented 1 year ago

It's because the categorical and numeric features are reordered and grouped together by data preprocessing.

Thank you, this is very helpful. The 'landCover' variable is indeed a categorical variable.

The order of feature importance seems to have changed according to the order of features after grouping them. It means they are in one-to-one correspondence.
output

However, calling flaml.default.LGBMRegressor in sklearn may cause some ambiguity, as the order of 'feature_name' may not match the order of 'columns'.
I suggest doing this when using flaml.default.LGBMRegressor to avoid later encountering errors when calling other Python packages for analysis.

import flaml.default
import lightgbm

model = flaml.default.LGBMRegressor()
model.fit(X, y)
if all(model.feature_name_ == X.columns.tolist()):
    pass
else:
    model = lightgbm.LGBMRegressor(**model.get_params())
    model.fit(X, y)
    assert all(model.feature_name_ == X.columns.tolist())
sonichi commented 1 year ago

Thanks for the suggestion. One solution is to disable data preprocessing in default.LGBMRegressor. Or make that configurable.