Closed kransom14 closed 5 years ago
That's a good point and makes sense. The factor version carries the importance for all of the factor levels. I'll get an example and add it to the section
Thanks. I added this to the end of that section:
One other effect of how qualitative predictors are encoded is related to summary measures. Many of these techniques, especially tree-based models, calculate variable importance scores that are relative measures for how much a predictor affected the outcome. For example, trees measure the effect of a specific split on the improvement in model performance (e.g. impurity, residual error, etc). As predictors are used in splits, these improvements are aggregated; these can be used as the importance scores. If a split involves all of the predictor's values (e.g. Saturday versus the other six days), the importance score for the entire variable is likely to be much larger than a similar importance score for an individual level (e.g. Saturday or not-Saturday). In the latter case, these fragmented scores for each level may not be ranked as highly as the analogous score that reflects all of the levels. A similar issue comes up during feature selection (Chapters X through X) and interaction detection (ChapterX). The choice of predictor encoding methods is discussed further there.
Reference to version dated 2018-05-12. I am hoping you could add some additional discussion to the section 5.6 Factors versus Dummy Variables in Tree-Based Models regarding variable importance. My team and I have been working with our categorical data both ways (dummy encodings and factors) and we noticed the relative importance of dummy encoded variables is usually at the bottom of the importance ranking, while when we use the factors directly, sometimes they are ranked as the most important. This has been important for us regarding gaining inference from these models. We are using
gbm
andrandomForest
packages along withcaret
for CV tuning.