stan-dev / projpred

Projection predictive variable selection
https://mc-stan.org/projpred/
Other
110 stars 26 forks source link

Varsel stat and stratified cross-validation for highly imbalanced data #328

Open adietzel opened 2 years ago

adietzel commented 2 years ago

Hi,

I am working on a model with highly imbalanced data which originates from few (as few as 30 observations) observations of species presences and a few thousand randomly selected background observations/pseudo-absences. I would like to use projpred to select the most relevant predictor variables from a pool of about 100 but am unsure whether any of the currently implemented varsel statistics deal well with severe class imbalance. The Brier (Skill) Score and Precision Recall AUC could be useful evaluation statistics, and were suggested by Aki Vehtari in issue #25 but, it seems, not implemented.

Since I would secondly like to cross-validate my variable selection, I was wondering whether there it is possible to somehow stratify the cross-validation procedure to ensure comparable class imbalances across folds? In the worst case some folds will only have (pseudo-) absences and no presences and hence fail.

Thanks so much for your help. Cheers

Andy

fweber144 commented 2 years ago

The Brier (Skill) Score and Precision Recall AUC could be useful evaluation statistics, and were suggested by Aki Vehtari in issue #25 but, it seems, not implemented.

As stated in #25, AUC has been implemented by #27 and the Brier score is effectively the MSE (see this comment). I'm currently not sure whether the Brier score (MSE) is actually available for the binomial family, i.e., if an error is thrown when trying to select it. If you experience such an error, please report with a reproducible example and then it should be easy to fix (in that case, I'll re-open this issue).

Since I would secondly like to cross-validate my variable selection, I was wondering whether there it is possible to somehow stratify the cross-validation procedure to ensure comparable class imbalances across folds?

Yes, this is possible via argument cvfits of init_refmodel(). Perhaps loo::kfold_split_grouped() and related functions (see ?loo::kfold_split_grouped) help in constructing the folds. See e.g. https://github.com/stan-dev/projpred/blob/33047d6c9ea98d0644cb0c8503424f61ab0127af/tests/testthat/test_varsel.R#L757-L760 or https://github.com/stan-dev/projpred/blob/33047d6c9ea98d0644cb0c8503424f61ab0127af/tests/testthat/test_varsel.R#L863-L866

adietzel commented 2 years ago

Thanks a lot for your answer, Frank. This helps a lot.

And apologies, I should have been more explicit in my question or problem statement. I am interested in an evaluation metric that focuses on predicting well the minority class and which is less sensitive to being swamped by good predictions of the majority class. Is the implemented AUC the AUROC (Receiver operating characteristic) which assumes that both classes (minority and majority) are important or the AUPRC (Area under the Precision-Recall Curve) which focuses on classifying correctly the minority class?

Or asked differently, which of the currently implemented selection statistics would you recommend for highly imbalanced data, when I am mostly interested in correctly classifying the minority class?

Thanks a lot for your help. Best Andreas

fweber144 commented 2 years ago

As far as I understand the source code in lines https://github.com/stan-dev/projpred/blob/33047d6c9ea98d0644cb0c8503424f61ab0127af/R/misc.R#L29-L54 and lines https://github.com/stan-dev/projpred/blob/33047d6c9ea98d0644cb0c8503424f61ab0127af/R/summary_funs.R#L232-L248 as well as your explanations, I think it is the former of the two definitions, i.e.

the AUROC (Receiver operating characteristic) which assumes that both classes (minority and majority) are important

If it helps: projpred's projpred:::auc() implementation seems to give the same results as pROC::auc():

set.seed(6834)
nobs <- 100L
mu <- binomial()$linkinv(-0.42 + rnorm(nobs))
y <- rbinom(nobs, size = 1, prob = mu)
dat <- data.frame(y = y, mu = mu)

( auc_val_projpred <- projpred:::auc(cbind(y, mu, 1)) )
library(pROC)
auc_val <- auc(y ~ mu, data = dat, direction = "<", algorithm = 1)
( auc_val_pROC <- auc_val[seq_along(auc_val)] ) # Indexing only to drop attributes.
stopifnot(isTRUE(all.equal(auc_val_pROC, auc_val_projpred, tolerance = 1e-20)))

(At that occasion, I'm realizing that the comment in line https://github.com/stan-dev/projpred/blob/33047d6c9ea98d0644cb0c8503424f61ab0127af/R/misc.R#L39 should read # false positive weights instead of # true negative weights and that I need to check projpred:::auc() in case of a binomial family with > 1 trials.)

adietzel commented 2 years ago

Thanks so much for looking further into this. The AUC implemented in projpred does indeed seem to be the AUROC.

Implementing additional evaluation metrics is likely currently not one of your priorities in further developing projpred but for cases with highly imbalanced data it would be wonderful to, in the future, be able to select metrics like the AUPRC or F1 score, which both focus on predicting correctly the minority class and should be used with highly unbalanced data (see, for instance, Saito & Rehmsmeiner 2015, doi: 10.1371/journal.pone.0118432).

I am not sure how often you and other users of projpred come across cases with highly imbalanced data and how difficult it would be to implement additional metrics in projpred.

Thank you for your help and understanding, and for developing this great project.

Best

Andreas

fweber144 commented 2 years ago

Implementing additional evaluation metrics is likely currently not one of your priorities in further developing projpred but for cases with highly imbalanced data it would be wonderful to, in the future, be able to select metrics like the AUPRC or F1 score, which both focus on predicting correctly the minority class and should be used with highly unbalanced data (see, for instance, Saito & Rehmsmeiner 2015, doi: 10.1371/journal.pone.0118432).

Yes, thank you for these helpful suggestions. I'm re-opening this issue as a feature request for these additional statistics.

adietzel commented 2 years ago

Excellent. Thanks!

adietzel commented 2 years ago

A short update on this feature request: I have tried to write a new function to calculate the true skill statistic, which is also known as the Peirce score or Hanssen-Kuipers discriminant and copes well with highly imbalanced data. The basic formula is:

tss = tpr + tnr - 1

where tpr = true positive rate and tnr the true negative rate.

Here is a short code snippet that calculates the tss for simulated data

# Simulate data
data <- data.frame(resp = rbinom(100, 1, 0.6),
                   pred = rbinom(100, 1, 0.6),
                   weights = rep(1, 100))

# Calcualte true positive rate
data$tp <- data$resp * data$pred
tpr <- sum(data$tp) / sum(data$resp) # tpr = tp / (tp + fn)
tpr

# Calculate true negative rate
data$tn <- ifelse(data$resp == 0 & data$pred == 0, 1, 0)
tnr <- sum(data$tn) / length(data$resp == 0) # tnr = tn / (fp + tn)
tnr

# Calculate True-Skill Statistic (Peirce score)
tss <- tpr + tnr - 1
tss

I tried to implement this function following the logic of the projpred::auc() function but I have struggled to wrap my head around how the weights are implemented and, specifically, how the cumulative sums of weights are used to calculate the different error rates. The auc() function already calculates the tpr and fpr. If the tnr and fnr were added, a whole host of different scores could be calculated (https://en.wikipedia.org/wiki/Sensitivity_and_specificity).

Any help would be appreciated. I will have another look at the auc() code next week. Cheers

Andreas

fweber144 commented 2 years ago

tss = tpr + tnr - 1

Interesting, I know this as Youden's Index (also seems to be called Youden's J statistic).

I have struggled to wrap my head around how the weights are implemented and, specifically, how the cumulative sums of weights are used to calculate the different error rates.

Indeed, this is not straightforward to see. I recommend to debug projpred:::auc() line-by-line using a simple example, e.g., the one from #329, slightly modified:

set.seed(6834)
nobs <- 100L
mu <- binomial()$linkinv(-0.42 + rnorm(nobs))
y <- rbinom(nobs, size = 1, prob = mu)

auc_val_projpred <- projpred:::auc(cbind(y, mu, 1))

The auc() function already calculates the tpr and fpr. If the tnr and fnr were added, a whole host of different scores could be calculated (https://en.wikipedia.org/wiki/Sensitivity_and_specificity).

Keep in mind that projpred:::auc() does not calculate the TPR or FPR, but instead one value per "classification rule", where "classification rule" means to classify all observations as "outcome = 1" for which the predicted probability for "outcome = 1" exceeds (or is equal to) a specific value from mu. (For the AUC, the classification rule is varied by varying a certain threshold, here in terms of mu.) So I'm not sure if this is what you expect/need.

adietzel commented 1 year ago

Thanks a lot for the suggestion, Frank. I haven't had the chance to look into this yet. I have been busy trying to implement cv_varsel() with stratified kfold cross-validation with a binomial response variable, as discussed above. I believe it is worth sharing in case other users encounter this issue and that this issue is the appropriate issue given its title. I am using stratified (by y) CV to ensure the few presences in my data are distributed equally across folds.

TLDR:

Here is a reproducible example, following the reprex you (Frank) used in https://github.com/stan-dev/projpred/issues/160 but tweaked it slightly to

Using a bernoulli distribution would be the most appropriate given that I am using presence-absence data. init_refmodel() unfortunately throws an error saying that bernoulli is not supported by projpred (version 2.2.1), which is interesting since it's meant to be supported. Perhaps not updated in init_refmodel() yet?

library(brms)
library(projpred)
data("df_gaussian")
mydat <- break_up_matrix_term(y ~ x, data = df_gaussian)$data
mydat <- head(mydat, 40)

mydat$y <- ifelse(mydat$y > 0, 1, 0)

myfit <- brm(
  formula = y ~ x1 + x2 + x3 + x4 + x5,
  data = mydat,
  family = bernoulli(),
  seed = 4981218
)

myformula <- formula(formula(myfit))
myextract_model_data <- function(object, newdata = NULL,
                                 wrhs = NULL, orhs = NULL, extract_y = TRUE) {
  resp_form <- if (extract_y) projpred:::lhs(myformula) else NULL
  args <- projpred:::nlist(object, newdata, wrhs, orhs, resp_form)
  return(projpred:::do_call(projpred:::.extract_model_data, args))
}
myK <- 5
myfolds <- loo::kfold_split_stratified(K = myK, x = mydat$y)

myfits_K <- kfold(myfit, folds = myfolds, save_fits = TRUE)
myrefmod <- init_refmodel(
  object = myfit,
  data = mydat,
  formula = myformula,
  family = myfit$family,
  extract_model_data = myextract_model_data,
  # dis = as.matrix(myfit)[, "sigma"], # original
  dis = rep(0, nrow(as.matrix(myfit))), # work around since setting dis = NULL throws an error
  cvfits = structure(list("fits" = myfits_K$fits[, "fit"]),
                     "K" = myK, "folds" = myfolds)
)

If I instead use a binomial, the appropriate formula specification in brms would be: "y | trials(1) ~ ... " Skipping trials(1) is deprecated, throws a warning, and might break in future versions?

Using trials(1), however, throws the error: "Error in trials(1) : could not find function "trials""

myfit <- brm(
  formula = y | trials(1) ~ x1 + x2 + x3 + x4 + x5,
  data = mydat,
  family = binomial(),
  seed = 4981218
)

myformula <- formula(formula(myfit))
myextract_model_data <- function(object, newdata = NULL,
                                 wrhs = NULL, orhs = NULL, extract_y = TRUE) {
  resp_form <- if (extract_y) projpred:::lhs(myformula) else NULL
  args <- projpred:::nlist(object, newdata, wrhs, orhs, resp_form)
  return(projpred:::do_call(projpred:::.extract_model_data, args))
}
myK <- 5
myfolds <- loo::kfold_split_stratified(K = myK, x = mydat$y)

myfits_K <- kfold(myfit, folds = myfolds, save_fits = TRUE)
myrefmod <- init_refmodel(
  object = myfit,
  data = mydat,
  formula = myformula,
  family = myfit$family,
  extract_model_data = myextract_model_data,
  # dis = as.matrix(myfit)[, "sigma"], # original
  dis = rep(0, nrow(as.matrix(myfit))), # work around since setting dis = NULL throws an error
  cvfits = structure(list("fits" = myfits_K$fits[, "fit"]),
                     "K" = myK, "folds" = myfolds)
)

Currently, the only way to make this approach work seems to be omitting trials(1) and living with the warning from brms:

myfit <- brm(
  formula = y ~ x1 + x2 + x3 + x4 + x5,
  data = mydat,
  family = binomial(),
  seed = 4981218
)

myformula <- formula(formula(myfit))
myextract_model_data <- function(object, newdata = NULL,
                                 wrhs = NULL, orhs = NULL, extract_y = TRUE) {
  resp_form <- if (extract_y) projpred:::lhs(myformula) else NULL
  args <- projpred:::nlist(object, newdata, wrhs, orhs, resp_form)
  return(projpred:::do_call(projpred:::.extract_model_data, args))
}
myK <- 5
myfolds <- loo::kfold_split_stratified(K = myK, x = mydat$y)
table(mydat$y, myfolds)
myfits_K <- kfold(myfit, folds = myfolds, save_fits = TRUE)
myrefmod <- init_refmodel(
  object = myfit,
  data = mydat,
  formula = myformula,
  family = binomial(),
  extract_model_data = myextract_model_data,
  # dis = as.matrix(myfit)[, "sigma"], # original
  dis = rep(0, nrow(as.matrix(myfit))), # work around since setting dis = NULL throws an error
  cvfits = structure(list("fits" = myfits_K$fits[, "fit"]),
                     "K" = myK, "folds" = myfolds)
)
mycvvs <- cv_varsel(
  myrefmod,
  method = "forward",
  cv_method = "kfold",
  K = myK,
  nclusters = 2,
  nclusters_pred = 3,
  nterms_max = 4,
  seed = 86132035
)

Just thought I would make you aware of this "issue" and provide a working example to others interested in using stratified CV with a binomial classifier.

Cheers

adietzel commented 1 year ago

As Frank pointed out in #352, the bernoulli() distribution can be implemented by using brms:::get_refmodel.brmsfit() instead of init_refmodel().

fweber144 commented 1 year ago

As Frank pointed out in https://github.com/stan-dev/projpred/issues/352, the bernoulli() distribution can be implemented by using brms:::get_refmodel.brmsfit() instead of init_refmodel().

Yes, thank you for updating the issue here, too. And for those stumbling across the issue here, I want to point out that brms:::get_refmodel.brmsfit()'s approach is to convert the brms::bernoulli() family to a binomial() one, so that approach could also be used for a custom reference model created by init_refmodel() if the underlying fit is a brms::bernoulli() one.

binomial() without specifying trials(1) deprecated in brms but including trials(1) throws error in init_refmodel() stating that function trials() could not be found

Yes, this is related to the fact that brms:::get_refmodel.brmsfit() (or the approach used there) needs to be used.

I am using stratified (by y) CV to ensure the few presences in my data are distributed equally across folds.

Stratifying K-fold CV by the response variable seems a bit odd to me. I think it's meant to be stratified by a predictor variable. But I haven't thought this through yet. It might be a valid procedure. In any case, this is a question related to package loo, not projpred.