Open laura-vangalen opened 1 month ago
This looks like a bug. Thank you for reporting it!
Why is this happening?
We try to show the value of individual columns using one color scheme and to make it more robust to outliers we show use quantiles of the points instead of their actual value. This should be relatively robust for continuous values (outlier won't make the point with just one color). Another advantage is that you can somehow compare the values between multiple columns - the same quantile will have the same color regardless the actual value.
How can I get the plot to properly normalize binary variables to be 0 and 1 even when I don't have features that are factors?
I would suggest using factors as the models might benefit from the information that the column contains discrete values.
But if you want to change how the values are normalized you can use the following code. I changed the code so that it doesn't use quantiles for columns with less than 32 unique values.
.uniformize <- function(col) {
if (is.factor(col)) {
return(.min_max(as.numeric(col) / nlevels(col)))
}
if (is.character(col) || all(is.na(col))) {
if (is.character(col) && !all(is.na(col))) {
fct <- as.factor(col)
return(.min_max(as.numeric(fct) / nlevels(fct)))
}
return(rep_len(0, length(col)))
}
res <- col
if (length(unique(col)) >= 32) # don't uniformize for low number of unique values
res <- stats::ecdf(col)(col)
res[is.na(res)] <- 0
return(res)
}
assignInNamespace(".uniformize", .uniformize, "h2o")
Thanks for your quick response. Another hack I found to change how the values are normalized was to add a fake character variable. This variable then got automatically deleted when running the model, but the normalizing still worked in the way I wanted. But thanks for the code, that is much better.
H2O version, Operating System and Environment Windows 10. R version 4.4.1. h2o R package version 3.44.0.3
Actual behavior When I use h2o.shap_summary_plot to plot the output of a random forest model with only numeric variables, the normalized versions of those the numeric variables that are binary (0,1) are not 0,1, they come out as ~0.5 and 1 (and so are plotted as purple and pink rather than blue and pink). But if I include a factor variable in my model, then they get normalized to 0 and 1 (and are plotted as blue and pink). Numeric variables seem to be being normalized differently depending on whether there are factor variables in the model.
Expected behavior I would expect the binary variables to be treated the same regardless of what other variables are in the model
Steps to reproduce
Screenshots "p1" plot - only numeric variables in the model. Both binary variables are not normalized to 0 and 1, more like 0.5 and 1
"p2" plot - "BinaryVar2" has been changed to a factor. Now the remaining numeric binary variable is normalized to 0 and 1
Why is this happening? How can I get the plot to properly normalize binary variables to be 0 and 1 even when I don't have features that are factors?