topepo / FES

Code and Resources for "Feature Engineering and Selection: A Practical Approach for Predictive Models" by Kuhn and Johnson
https://bookdown.org/max/FES
GNU General Public License v2.0
724 stars 237 forks source link

Request for additional discussion in 5.6 Factors versus Dummy Variables in Tree-Based Models #22

Closed kransom14 closed 5 years ago

kransom14 commented 6 years ago

Reference to version dated 2018-05-12. I am hoping you could add some additional discussion to the section 5.6 Factors versus Dummy Variables in Tree-Based Models regarding variable importance. My team and I have been working with our categorical data both ways (dummy encodings and factors) and we noticed the relative importance of dummy encoded variables is usually at the bottom of the importance ranking, while when we use the factors directly, sometimes they are ranked as the most important. This has been important for us regarding gaining inference from these models. We are using gbm and randomForest packages along with caret for CV tuning.

topepo commented 6 years ago

That's a good point and makes sense. The factor version carries the importance for all of the factor levels. I'll get an example and add it to the section

topepo commented 5 years ago

Thanks. I added this to the end of that section:

One other effect of how qualitative predictors are encoded is related to summary measures. Many of these techniques, especially tree-based models, calculate variable importance scores that are relative measures for how much a predictor affected the outcome. For example, trees measure the effect of a specific split on the improvement in model performance (e.g. impurity, residual error, etc). As predictors are used in splits, these improvements are aggregated; these can be used as the importance scores. If a split involves all of the predictor's values (e.g. Saturday versus the other six days), the importance score for the entire variable is likely to be much larger than a similar importance score for an individual level (e.g. Saturday or not-Saturday). In the latter case, these fragmented scores for each level may not be ranked as highly as the analogous score that reflects all of the levels. A similar issue comes up during feature selection (Chapters X through X) and interaction detection (ChapterX). The choice of predictor encoding methods is discussed further there.