Open benthestatistician opened 10 years ago
Regarding the linear algebra: Seems related to situations in which the Wishart distribution fails to have a density. I get this after reading in Muirhead, Aspects of Multivariate Statistical Theory. Muirhead refers in turn to Eaton and Perlman, 1973, "The non-singularity of generalized sample covariance matrices,'' Annals of Statistics, 1, 710-717, and to Das Gupta, 1971, "Non-singularity of the sample covariance matrix," Sankhya A, 33, 475--78.
We don't need to finish sorting through the linear algebra issues for work to get going on this; the two projects could be undertaken in parallel, or asynchronously.
Since the xBalance
chisquare statistic is interpretable as a squared Mahalanobis distance of imbalances from the origin, let's call it the joint norm of imbalances. (Better yet would to associate that name with the square root of the chi-square statistics divided by its d.f.; that distinction is neither here not there for current purposes.)
I created a new branch, issue25.
In the function alignToInferentials from Design.R I've replaced the calculation of csq with a method that uses the Q from a QR decomposition of tmat.
I plugged in a placeholder for the DF output. This should be changed to the correct DF, if someone could take a look and tell me what it should be.
I didn't do too much testing yet. It works and passes existing tests, but I'll add some tests.
We also need to compare the performance of this method with the old method.
Re DF output: For now, the DF should be the integer we find in the rank
slot of the QR output, ie qr(tmat.scaled)$rank
.
Eventually, as part of the fix for the underlying bug behind #25, we'll
need to do some auxiliary analysis of this QR object in order to decide
whether to drop one or more of the rightmost Q columns. When we do that:
rotated.X
will of course have to be thinned down to exclude the dropped
columns; and the DF will be the rank of the QR object minus however many Q
columns have been excluded. These are next steps, however.
On Tue, Dec 22, 2015 at 9:42 PM, Josh Buckner notifications@github.com wrote:
I created a new branch, issue25.
In the function alignToInferentials from Design.R I've replaced the calculation of csq with a method that uses the Q from a QR decomposition of tmat.
I plugged in a placeholder for the DF output. This should be changed to the correct DF, if someone could take a look and tell me what it should be.
I didn't do too much testing yet. It works and passes existing tests, but I'll add some tests.
We also need to compare the performance of this method with the old method.
— Reply to this email directly or view it on GitHub https://github.com/markmfredrickson/RItools/issues/25#issuecomment-166791542 .
I updated DF according to your comment.
Examples of the problem. By construction, x1 is out of balance, other xes are noise.
genDat <- function(n,k)
{
dat <-
data.frame(matrix(rnorm(n*k),
n, k)
)
names(dat)[1L:k] <- paste0('x',1L:k)
dat <- data.frame(z=numeric(n), dat )
while (var(dat$z)==0) # avoid breaking sims b/c z is all 1 or all 0
dat$z <- as.numeric( dat[['x1']]+rnorm(n)>0)
dat
}
Now for the xBal part:
getpvals <- function(n,k,N)
{
pvs <- replicate(N,
{
dat <- genDat(n, k)
xBalance(formula(dat), data=dat, report="chisq")$overall[,'p.value']
})
pvs
}
... and finally some examples.
> pvs <- getpvals(5,1,100)
> summary(pvs) # conservative, should have rejected ~5 times not none
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.05122 0.10780 0.18080 0.29280 0.42840 0.98620
> pvs <- getpvals(5,2,100)
> summary(pvs) # more conservative
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.1398 0.2017 0.2694 0.3558 0.4451 0.9912
> pvs <- getpvals(5,3,100)
> summary(pvs) # bad but not awful
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.2615 0.2896 0.3495 0.4173 0.4802 0.9922
> pvs <- getpvals(5,4,100)
> summary(pvs) # degenerate
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.406 0.406 0.406 0.406 0.406 0.406
this methodological paper of (Young, 2016) is relevant.
Conjecture: in the combined difference statistic, you standardize by the generalized inverse of any nonnegative matrix other than the covariance of adjusted differences (relative to the null randomization distribution), the degeneracy problem will disappear. The distribution of this statistic won't be chi-square any more, even for favorable (n,p) combinations; instead the natural approximation for it will be as a weighted sum of chi-squares. This approximation will be similarly non-robust to largish p, but that problem can be kicked down the road a little.
The CompQuadForm package provides quantiles, etc for weighted sums of chi squares.
As a hack, could we return the maxT test statistic results from independence_test
(for non-cluster situations)? Not 100% sure that these results below are super-favorable for the maxT test, but at least not degenerate. (Including @benthestatistician results from above for comparison.)
R version 4.1.1 (2021-08-10) -- "Kick Things"
Copyright (C) 2021 The R Foundation for Statistical Computing
Platform: x86_64-apple-darwin17.0 (64-bit)
R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.
Natural language support but running in an English locale
R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.
Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.
> library(coin)
Loading required package: survival
> library(RItools)
Loading required package: SparseM
Attaching package: ‘SparseM’
The following object is masked from ‘package:base’:
backsolve
>
> genDat <- function(n, k) {
+ dat <-
+ data.frame(matrix(
+ rnorm(n * k),
+ n, k
+ ))
+ names(dat)[1L:k] <- paste0("x", 1L:k)
+ dat <- data.frame(z = numeric(n), dat)
+ while (var(dat$z) == 0) { # avoid breaking sims b/c z is all 1 or all 0
+ dat$z <- as.numeric(dat[["x1"]] + rnorm(n) > 0)
+ }
+ dat
+ }
>
> getpvals <- function(n, k, N) {
+ pvs <- replicate(N, {
+ dat <- genDat(n, k)
+ xbp <- xBalance(formula(dat), data = dat, report = "chisq")$overall[, "p.value"]
+ dat$zF <- factor(dat$z)
+ coin_fmla <- as.formula(paste(paste(names(dat)[-1], collapse = "+"), "~zF"))
+ coinp <- pvalue(independence_test(coin_fmla, data = dat))
+ c(xbp, coinp)
+ })
+ pvs
+ }
>
> pvs <- getpvals(5, 1, 100)
> apply(pvs, 1, summary)
[,1] [,2]
Min. 0.04876140 0.05285054
1st Qu. 0.09452078 0.06913298
Median 0.16397380 0.07644819
Mean 0.27407699 0.07581206
3rd Qu. 0.38178422 0.08435319
Max. 0.90999496 0.08885573
> ## > summary(pvs) # conservative, should have rejected ~5 times not none
> ## Min. 1st Qu. Median Mean 3rd Qu. Max.
> ## 0.05122 0.10780 0.18080 0.29280 0.42840 0.98620
> pvs <- getpvals(5, 2, 100)
> apply(pvs, 1, summary)
[,1] [,2]
Min. 0.1380421 0.07898181
1st Qu. 0.1830466 0.10070713
Median 0.2595936 0.10824416
Mean 0.3449585 0.10775444
3rd Qu. 0.4689084 0.11610262
Max. 0.9948093 0.12805490
> ## > summary(pvs) # more conservative
> ## Min. 1st Qu. Median Mean 3rd Qu. Max.
> ## 0.1398 0.2017 0.2694 0.3558 0.4451 0.9912
> pvs <- getpvals(5, 3, 100)
> apply(pvs, 1, summary)
[,1] [,2]
Min. 0.2614649 0.09853688
1st Qu. 0.2717162 0.12544456
Median 0.3174379 0.13474620
Mean 0.3743291 0.13414189
3rd Qu. 0.4302293 0.14387579
Max. 0.9702020 0.15716720
> ## > summary(pvs) # bad but not awful
> ## Min. 1st Qu. Median Mean 3rd Qu. Max.
> ## 0.2615 0.2896 0.3495 0.4173 0.4802 0.9922
> pvs <- getpvals(5, 4, 100)
> apply(pvs, 1, summary)
[,1] [,2]
Min. 0.4060058 0.1146464
1st Qu. 0.4060058 0.1498375
Median 0.4060058 0.1573900
Mean 0.4060058 0.1559197
3rd Qu. 0.4060058 0.1652102
Max. 0.4060058 0.1841966
> ## > summary(pvs) # degenerate
> ## Min. 1st Qu. Median Mean 3rd Qu. Max.
> ## 0.406 0.406 0.406 0.406 0.406 0.406l
>
>
>
> proc.time()
user system elapsed
6.159 0.179 6.466
Recording another idea here: Once we can report that we've detected a situation ripe for degenerate p-values we could ask whether the analyst would prefer to use singular values (say no more than 10 or min(\sqrt{n},\sqrt{clusters}) or something). This would still produce using omnibus tests. I recall that maybe Mark already had an idea along these lines a while back.
Also recording the code from above, confirming that coin's quadratic test statistic has the same issues as ours.
Adam Sales brought to my attention some examples in which chisquare statistics and degrees of freedom are reported, both precisely the same. There was a one-stratum case and a matched case, both with p>n. Turned out that in both cases the actual permutation dist'n of the chisquare statistic was degenerate: Reshuffling the treatment variable changed the variable-wise z-stats, adjusted diffs etc, but you always got the same chisquare and d.f.
These cases should be marked and identified; perhaps the p-value should be returned as NA. But the first task is to specify the situations in which the problem arises. I suspect there's a simple answer based on linear algebra, but I'm not certain of that.