sem-in-r / seminr

Natural feeling domain-specific language for building structural equation models in R for estimation by covariance-based methods (like LISREL/Lavaan) or partial least squares (like SmartPLS)
58 stars 19 forks source link

bootstrap_model errors with near-zero-variance binary variables #339

Open Gwstat opened 9 months ago

Gwstat commented 9 months ago

Hi everyone,

i want to use a partial least squares path model in order to calculate some good weight-guesses for some pre-defined composite indices. Each of these indices consists of a bunch of binary (0-1-coded) variables. Some of these items got a very low variance because the probability of them "having 1" is less than 1%. I still expect these items to have significant weights, so i don't want to completely drop them out of the analysis.

Everything works fine when i'm using seminr::estimate_pls to calculate the initial model. However, i also want to calculate the confidence intervals via bootstrap (to decide which of the items are not significant). Unfortunately i encountered multiple different error-messages when i try to use seminr::bootstrap_model ("singular matrix", "zero variance items cannot be scaled" etc). I figured out this is due to "incovenient" resamples where some indicators are constant 0 (which is more likely the less probability "1" has) and some unlucky samples where some indicators are perfectly correlated. Here is a reproducible example:

set.seed(12224232)
n <- 5000
df <- data.frame(
  i1 = sample(c(0,1),n,replace = T, prob = c(0.5,0.5)),
  i2 = sample(c(0,1),n,replace = T, prob = c(0.3,0.7)),
  i3 = sample(c(0,1),n,replace = T, prob = c(0.4,0.6)),
  i4 = sample(c(0,1),n,replace = T, prob = c(0.998,0.002)),
  i5 = sample(c(0,1),n,replace = T, prob = c(0.98,0.02)),
  e1 = sample(c(0,1),n, replace = T, prob = c(0.1,0.9)),
  e2 = sample(c(0,1),n, replace = T, prob = c(0.2,0.8)),
  e3 = sample(c(0,1),n, replace = T, prob = c(0.95,0.05)),
  e4 = sample(c(0,1),n, replace = T, prob = c(0.8,0.2)),
  e5 = sample(c(0,1),n, replace = T, prob = c(0.998,0.002))
  ) |> 
  dplyr::mutate(target =  rnorm(dplyr::n(),2,1) + i1*0.1+i2*0.1+i3*0.08+i4*0.19+i5*0.25 +
                  + e1*0.1+e2*0.1+e3*0.08+e4*0.19+e5*0.25)

df |> apply(2, sum)
# as you can see: there are some variables with less than 20 observations

structure <-  seminr::relationships(
  seminr::paths(from = c("index1","index2"),to = "target") 
  )

measurements <- seminr::constructs(
  seminr::composite("index1", 
                    seminr::multi_items("i", 1:5),
                    weights = seminr::mode_B),
  seminr::composite("index2", 
                    seminr::multi_items("e", 1:5),
                    weights = seminr::mode_B),
  seminr::reflective("target", "target")
)

pls_model <- seminr::estimate_pls(
  data = df,
  measurement_model = measurements,
  structural_model = structure
)

# error here:
boot_model <- seminr::bootstrap_model(pls_model, nboot = 2000)

As you can see, bootstrap_model will run for a few seconds and stops since it encounters the "zero variance"-error. Unfortunately the algorithm seems to drop all the "valid" samples as well. In more complex models this feels very frustrating, since the function works for a few minutes and eventually encounters an error.

Is it possible to adjust to function that it won't stop and just skips to the next iteration when encountering the zero variance issue? Did i miss an argument to avoid this? Or does somebody have a idea for a workaround?

Thank you in advance for your help!

soumyaray commented 9 months ago

Hi @Gwstat thanks for reporting this! This is a philosophical conundrum we have faced before (e.g., with consistent PLS where its almost the norm to have erratic iterations). It raises the issue whether the bootstrap procedure is even a reliable source of information if the data at hand does not have enough variance to sustain a resampling process. I'll discuss with @NicholasDanks if this is something we should "fix" or something that authors should resolve in their own data. While its tempting to make a quick fix (as with so many issues), it will result in many people publishing work that cannot truly replicate or might not even be valid.

While we discuss this, I would suggest brainstorming some workarounds. For example, if there are zero-variance columns (binary or otherwise), it suggests an issue of imbalance in the data. Here, there are oversampling/undersampling methods (keeping total n the same) that might help create a more balanced dataset. I realize this is more common in predictive analytics and more unusual for inference, but perhaps its been discussed somewhere? The closest I can find is stratified bootstrapping.