as per Li and Redden (2014), using the standard sandwich variance-covariance matrix to perform a Wald test of significance - as we do in waldtestGEE() - leads to inflated Type I error (false positive) rates when the number of subjects is small and / or the number of timepoints per-subject is large
this was seen in our simulation study, where we observed high numbers of false positives when using $n_s = 6$ subjects
scRNA-seq datasets quite often have low numbers of subjects compared to observations per-subject, making this a problem for the GEE mode
several small-sample bias corrections exist, the easiest of which to implement is the DF correction, where $p$ is the number of covariates:
thus, it makes sense to add an option to bias-correct the estimated sandwich variance-covariance matrix when the number of subjects is lower than some heuristic threshold e.g., $n_s = 30$
in addition to the naive DF correction mentioned above, implement the Kauermann and Carroll correction which should work better (i hope)
this is much more complicated, as it involves recreating the subject-level correlation ($\mathbf{R}_i$) and vcov ($\mathbf{V}_i$) matrices as well as the projection matrix $\mathbf{H}$
probably need to built out bias correction into its own function
call it biasCorrectGEE() and do not export to users
arguments: fitted model object, data to fit model, assumed correlation structure (AR-1 or exchangeable), and variable specifying subject ID
waldtestGEE()
- leads to inflated Type I error (false positive) rates when the number of subjects is small and / or the number of timepoints per-subject is large$$ \mathbf{V}_{df} = \left(\frac{n_s}{n_s-p}\right) \mathbf{V} $$