nubank / fklearn

fklearn: Functional Machine Learning
Apache License 2.0
1.51k stars 165 forks source link

Adds support to specify categorical features in lgbm learner #197

Open fberanizo opened 2 years ago

fberanizo commented 2 years ago

Status

READY

Todo list

Background context

Older versions of LigthGBM used to allow passing categorical_featurethrough the argument param, but recent versions raise the following warning:

UserWarning: categorical_feature keyword has been found in `params` and will be ignored.
Please use categorical_feature argument of the Dataset constructor to pass this parameter.
  .format(key))

It is dubious (for me) if the warning is misleading and the option still works. A contributor at lightgbm said that the correct way to pass them is through the Dataset object.

Description of the changes proposed in the pull request

Adds a new option categorical_features to lgbm_classification_learner. It allows a list of column names that should be treated as categorical features. Further instructions are found in https://github.com/Microsoft/LightGBM/blob/master/docs/Parameters.rst

codecov-commenter commented 2 years ago

Codecov Report

Merging #197 (bf4dc15) into master (aeaa36c) will not change coverage. The diff coverage is 100.00%.

@@           Coverage Diff           @@
##           master     #197   +/-   ##
=======================================
  Coverage   94.24%   94.24%           
=======================================
  Files          32       32           
  Lines        1928     1928           
  Branches      258      258           
=======================================
  Hits         1817     1817           
  Misses         76       76           
  Partials       35       35           
Impacted Files Coverage Δ
src/fklearn/training/classification.py 100.00% <100.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update aeaa36c...bf4dc15. Read the comment docs.

fberanizo commented 2 years ago

I'm worried that the change below might alter the behavior of existing models (due to the use of pandas + categorical_feature='auto' setting).
Is this still a reasonable change?

-  dtrain = lgbm.Dataset(df[features].values, label=df[target], feature_name=list(map(str, features)), weight=weights,
-                          silent=True)
+    dtrain = lgbm.Dataset(df[features], label=df[target], feature_name=list(map(str, features)), weight=weights,
+                          silent=True, categorical_feature=categorical_features)
jmoralez commented 2 years ago

I'd say this change is for the better because it's the same behavior as using LightGBM directly. However, taking a closer look at the code I see that when predicting the values attribute is used as well: https://github.com/nubank/fklearn/blob/a3de09a6949089f47633e64b536a808650d7e9d1/src/fklearn/training/regression.py#L487 https://github.com/nubank/fklearn/blob/a3de09a6949089f47633e64b536a808650d7e9d1/src/fklearn/training/classification.py#L70 So it'll definitely cause some headaches if we don't change it there as well. Yet for SHAP the dataframe is used: https://github.com/nubank/fklearn/blob/a3de09a6949089f47633e64b536a808650d7e9d1/src/fklearn/training/regression.py#L492 I think it'd be best to use the dataframe everywhere to not cause any surprises on the user, and using values isn't always more efficient than the dataframe. Also the dataframe allows to use the categorical features in their "raw" form, i.e. if we leave the .values there the user will always have to convert them to integer codes.

tiagorm commented 2 years ago

I'm worried that the change below might alter the behavior of existing models (due to the use of pandas + categorical_feature='auto' setting). Is this still a reasonable change?

-  dtrain = lgbm.Dataset(df[features].values, label=df[target], feature_name=list(map(str, features)), weight=weights,
-                          silent=True)
+    dtrain = lgbm.Dataset(df[features], label=df[target], feature_name=list(map(str, features)), weight=weights,
+                          silent=True, categorical_feature=categorical_features)

Agree with the above comments, this change itself looks good but is better to review the .values usages

fberanizo commented 2 years ago

Hi guys! Sorry for the (very) late response. Recently, @isphus1973 and @hellenlima reached out to me asking about this feature, and I finally spared some time to finish it.

The recent commits: