Variable importance for nominal variables with few categories

Theoretically, it should be sound to perform variable importance assessment based on a grid of counterfactual shift values with nominal variables; however, in practice, such variables (even when converted via as.numeric) have few unique values. This leads to a downstream bug due to sl3's Variable_Type where the nominal variables are categorized as categorical rather than continuous. This bug is non-trivial to track down and can be distressing to users. A simple but naive solution is to add mean-zero noise to nominal variables such that there appear to be more than 20 or so unique values, as this is sufficient to trick sl3 into recognizing the variable as continuous. For example, in the following variable u has only 4 (ordered) categories but will be recognized as categorical:

n <- 10000
u_idx <- runif(n)
u <- rep(NA, n)
u[u_idx <= 0.1] <- "A"
u[u_idx > 0.1 & u_idx <= 0.3] <- "B"
u[u_idx > 0.3 & u_idx <= 0.95] <- "C"
u[u_idx > 0.95] <- "D"
u <- as.numeric(as.factor(u))

To have it recognized as continuous, one could implement

u <- u + runif(n, -0.001, 0.001)

which will have more categories than the original u yet remain the same in expectation.

tlverse / tmle3shift

Variable importance for nominal variables with few categories #24