Open fberanizo opened 2 years ago
Merging #197 (bf4dc15) into master (aeaa36c) will not change coverage. The diff coverage is
100.00%
.
@@ Coverage Diff @@
## master #197 +/- ##
=======================================
Coverage 94.24% 94.24%
=======================================
Files 32 32
Lines 1928 1928
Branches 258 258
=======================================
Hits 1817 1817
Misses 76 76
Partials 35 35
Impacted Files | Coverage Δ | |
---|---|---|
src/fklearn/training/classification.py | 100.00% <100.00%> (ø) |
Continue to review full report at Codecov.
Legend - Click here to learn more
Δ = absolute <relative> (impact)
,ø = not affected
,? = missing data
Powered by Codecov. Last update aeaa36c...bf4dc15. Read the comment docs.
I'm worried that the change below might alter the behavior of existing models (due to the use of pandas + categorical_feature='auto' setting).
Is this still a reasonable change?
- dtrain = lgbm.Dataset(df[features].values, label=df[target], feature_name=list(map(str, features)), weight=weights,
- silent=True)
+ dtrain = lgbm.Dataset(df[features], label=df[target], feature_name=list(map(str, features)), weight=weights,
+ silent=True, categorical_feature=categorical_features)
I'd say this change is for the better because it's the same behavior as using LightGBM directly. However, taking a closer look at the code I see that when predicting the values
attribute is used as well: https://github.com/nubank/fklearn/blob/a3de09a6949089f47633e64b536a808650d7e9d1/src/fklearn/training/regression.py#L487
https://github.com/nubank/fklearn/blob/a3de09a6949089f47633e64b536a808650d7e9d1/src/fklearn/training/classification.py#L70
So it'll definitely cause some headaches if we don't change it there as well. Yet for SHAP the dataframe is used:
https://github.com/nubank/fklearn/blob/a3de09a6949089f47633e64b536a808650d7e9d1/src/fklearn/training/regression.py#L492
I think it'd be best to use the dataframe everywhere to not cause any surprises on the user, and using values
isn't always more efficient than the dataframe. Also the dataframe allows to use the categorical features in their "raw" form, i.e. if we leave the .values
there the user will always have to convert them to integer codes.
I'm worried that the change below might alter the behavior of existing models (due to the use of pandas + categorical_feature='auto' setting). Is this still a reasonable change?
- dtrain = lgbm.Dataset(df[features].values, label=df[target], feature_name=list(map(str, features)), weight=weights, - silent=True) + dtrain = lgbm.Dataset(df[features], label=df[target], feature_name=list(map(str, features)), weight=weights, + silent=True, categorical_feature=categorical_features)
Agree with the above comments, this change itself looks good but is better to review the .values
usages
Hi guys! Sorry for the (very) late response. Recently, @isphus1973 and @hellenlima reached out to me asking about this feature, and I finally spared some time to finish it.
The recent commits:
lgbm_regression_learner
df.values
with df
, as you suggested
Status
READY
Todo list
Background context
Older versions of LigthGBM used to allow passing
categorical_feature
through the argumentparam
, but recent versions raise the following warning:It is dubious (for me) if the warning is misleading and the option still works. A contributor at lightgbm said that the correct way to pass them is through the Dataset object.
Description of the changes proposed in the pull request
Adds a new option
categorical_features
tolgbm_classification_learner
. It allows a list of column names that should be treated as categorical features. Further instructions are found in https://github.com/Microsoft/LightGBM/blob/master/docs/Parameters.rst