h2oai / h2o-3

H2O is an Open Source, Distributed, Fast & Scalable Machine Learning Platform: Deep Learning, Gradient Boosting (GBM) & XGBoost, Random Forest, Generalized Linear Modeling (GLM with Elastic Net), K-Means, PCA, Generalized Additive Models (GAM), RuleFit, Support Vector Machine (SVM), Stacked Ensembles, Automatic Machine Learning (AutoML), etc.
http://h2o.ai
Apache License 2.0
6.94k stars 2k forks source link

h2o.shap_summary_plot feature data normalization issue with binary variables #16407

Open laura-vangalen opened 1 month ago

laura-vangalen commented 1 month ago

H2O version, Operating System and Environment Windows 10. R version 4.4.1. h2o R package version 3.44.0.3

Actual behavior When I use h2o.shap_summary_plot to plot the output of a random forest model with only numeric variables, the normalized versions of those the numeric variables that are binary (0,1) are not 0,1, they come out as ~0.5 and 1 (and so are plotted as purple and pink rather than blue and pink). But if I include a factor variable in my model, then they get normalized to 0 and 1 (and are plotted as blue and pink). Numeric variables seem to be being normalized differently depending on whether there are factor variables in the model.

Expected behavior I would expect the binary variables to be treated the same regardless of what other variables are in the model

Steps to reproduce

h2o.init()
example <- data.frame(
  NumericVar = rnorm(100, mean = 50, sd = 10), # Numeric variable (normal distribution)
  BinaryVar = sample(c(0, 1), 100, replace = TRUE), # Binary variable (0, 1)
  BinaryVar2 = sample(c(0, 1), 100, replace = TRUE) # Binary variable (0, 1)
)
example$CorrelatedVar = example$NumericVar * 0.8 + rnorm(100, mean = 0, sd = 5)  # Add response variable

# run and plot model that contains numeric values only. The binary numeric variables don't get normalized to 0 and 1.
regressionMatrix <- as.h2o(example)
rfModel <- h2o.randomForest(training_frame = regressionMatrix,
                            y = "CorrelatedVar",
                            ntrees = 500,
                            mtries = 3,
                            sample_rate = 0.632,
                            min_rows = 2,
                            seed = 42,
                            max_depth = 20)
p1=h2o.shap_summary_plot(
  model = rfModel,
  newdata = regressionMatrix
)
p1

# change one variable to a factor, then run and plot the model. The remaining binary numeric variable does get normalized to 0 and 1
example$BinaryVar2=as.factor(example$BinaryVar2) # change one of the binary variables to a factor

regressionMatrix <- as.h2o(example)
rfModel <- h2o.randomForest(training_frame = regressionMatrix,
                            y = "CorrelatedVar",
                            ntrees = 500,
                            mtries = 3,
                            sample_rate = 0.632,
                            min_rows = 2,
                            seed = 42,
                            max_depth = 20)
p2=h2o.shap_summary_plot(
  model = rfModel,
  newdata = regressionMatrix
)
p2

Screenshots "p1" plot - only numeric variables in the model. Both binary variables are not normalized to 0 and 1, more like 0.5 and 1

image

"p2" plot - "BinaryVar2" has been changed to a factor. Now the remaining numeric binary variable is normalized to 0 and 1

image

Why is this happening? How can I get the plot to properly normalize binary variables to be 0 and 1 even when I don't have features that are factors?

tomasfryda commented 1 month ago

This looks like a bug. Thank you for reporting it!

Why is this happening?

We try to show the value of individual columns using one color scheme and to make it more robust to outliers we show use quantiles of the points instead of their actual value. This should be relatively robust for continuous values (outlier won't make the point with just one color). Another advantage is that you can somehow compare the values between multiple columns - the same quantile will have the same color regardless the actual value.

How can I get the plot to properly normalize binary variables to be 0 and 1 even when I don't have features that are factors?

I would suggest using factors as the models might benefit from the information that the column contains discrete values.

But if you want to change how the values are normalized you can use the following code. I changed the code so that it doesn't use quantiles for columns with less than 32 unique values.

.uniformize <- function(col) {
  if (is.factor(col)) {
    return(.min_max(as.numeric(col) / nlevels(col)))
  }
  if (is.character(col) || all(is.na(col))) {
    if (is.character(col) && !all(is.na(col))) {
      fct <- as.factor(col)
      return(.min_max(as.numeric(fct) / nlevels(fct)))
    }
    return(rep_len(0, length(col)))
  }
  res <- col
  if (length(unique(col)) >= 32) # don't uniformize for low number of unique values
    res <- stats::ecdf(col)(col)
  res[is.na(res)] <- 0
  return(res)
}

assignInNamespace(".uniformize", .uniformize, "h2o")
laura-vangalen commented 1 month ago

Thanks for your quick response. Another hack I found to change how the values are normalized was to add a fake character variable. This variable then got automatically deleted when running the model, but the normalizing still worked in the way I wanted. But thanks for the code, that is much better.