markmfredrickson / RItools

Randomization inference tools for R
GNU General Public License v2.0
17 stars 11 forks source link

`xBalance` should identify and mark situations where chisquare stat has degenerate distribution #25

Open benthestatistician opened 10 years ago

benthestatistician commented 10 years ago

Adam Sales brought to my attention some examples in which chisquare statistics and degrees of freedom are reported, both precisely the same. There was a one-stratum case and a matched case, both with p>n. Turned out that in both cases the actual permutation dist'n of the chisquare statistic was degenerate: Reshuffling the treatment variable changed the variable-wise z-stats, adjusted diffs etc, but you always got the same chisquare and d.f.

These cases should be marked and identified; perhaps the p-value should be returned as NA. But the first task is to specify the situations in which the problem arises. I suspect there's a simple answer based on linear algebra, but I'm not certain of that.

benthestatistician commented 10 years ago

Regarding the linear algebra: Seems related to situations in which the Wishart distribution fails to have a density. I get this after reading in Muirhead, Aspects of Multivariate Statistical Theory. Muirhead refers in turn to Eaton and Perlman, 1973, "The non-singularity of generalized sample covariance matrices,'' Annals of Statistics, 1, 710-717, and to Das Gupta, 1971, "Non-singularity of the sample covariance matrix," Sankhya A, 33, 475--78.

benthestatistician commented 10 years ago

We don't need to finish sorting through the linear algebra issues for work to get going on this; the two projects could be undertaken in parallel, or asynchronously.

Since the xBalance chisquare statistic is interpretable as a squared Mahalanobis distance of imbalances from the origin, let's call it the joint norm of imbalances. (Better yet would to associate that name with the square root of the chi-square statistics divided by its d.f.; that distinction is neither here not there for current purposes.)

  1. The basic issue is that there are cases in which the permutation distribution of the joint norm of imbalances is degenerate.
  2. I'm confident that there's a simple linear algebra based rule for picking those cases out. Coding of the solution could get started even before that rule has been determined, w/ some sort of placeholder criterion put in instead.
  3. The solution, when the joint norm of imbalances has been found to be degenerate, is to zero out just enough of the smallest eigenvalues of the covariance matrix to make it non-degenerate. Then we provide our answer with a warning, something along the lines of "For stratification {strat}, calculating joint norm of imbalances based on first {k} eigenvectors of their permutation covariance, to ensure that the norm has a non-degenerate permutation distribution."
nullsatz commented 8 years ago

I created a new branch, issue25.

In the function alignToInferentials from Design.R I've replaced the calculation of csq with a method that uses the Q from a QR decomposition of tmat.

I plugged in a placeholder for the DF output. This should be changed to the correct DF, if someone could take a look and tell me what it should be.

I didn't do too much testing yet. It works and passes existing tests, but I'll add some tests.

We also need to compare the performance of this method with the old method.

benthestatistician commented 8 years ago

Re DF output: For now, the DF should be the integer we find in the rank slot of the QR output, ie qr(tmat.scaled)$rank.

Eventually, as part of the fix for the underlying bug behind #25, we'll need to do some auxiliary analysis of this QR object in order to decide whether to drop one or more of the rightmost Q columns. When we do that: rotated.X will of course have to be thinned down to exclude the dropped columns; and the DF will be the rank of the QR object minus however many Q columns have been excluded. These are next steps, however.

On Tue, Dec 22, 2015 at 9:42 PM, Josh Buckner notifications@github.com wrote:

I created a new branch, issue25.

In the function alignToInferentials from Design.R I've replaced the calculation of csq with a method that uses the Q from a QR decomposition of tmat.

I plugged in a placeholder for the DF output. This should be changed to the correct DF, if someone could take a look and tell me what it should be.

I didn't do too much testing yet. It works and passes existing tests, but I'll add some tests.

We also need to compare the performance of this method with the old method.

— Reply to this email directly or view it on GitHub https://github.com/markmfredrickson/RItools/issues/25#issuecomment-166791542 .

nullsatz commented 8 years ago

I updated DF according to your comment.

benthestatistician commented 8 years ago

Examples of the problem. By construction, x1 is out of balance, other xes are noise.

            genDat <- function(n,k)
                  {
                      dat <-
                          data.frame(matrix(rnorm(n*k),
                                            n, k)
                                     )
              names(dat)[1L:k] <- paste0('x',1L:k)
                      dat <- data.frame(z=numeric(n), dat  )
                      while (var(dat$z)==0) # avoid breaking sims b/c z is all 1 or all 0
                          dat$z <- as.numeric( dat[['x1']]+rnorm(n)>0)
                      dat
                  }

Now for the xBal part:

         getpvals <- function(n,k,N)
              {
              pvs <- replicate(N, 
                  {
                      dat <- genDat(n, k)
                      xBalance(formula(dat), data=dat, report="chisq")$overall[,'p.value']
                  })
              pvs
          }

... and finally some examples.

> pvs <- getpvals(5,1,100)
> summary(pvs) # conservative, should have rejected ~5 times not none
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
0.05122 0.10780 0.18080 0.29280 0.42840 0.98620 
> pvs <- getpvals(5,2,100) 
> summary(pvs) # more conservative
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
 0.1398  0.2017  0.2694  0.3558  0.4451  0.9912 
> pvs <- getpvals(5,3,100)
> summary(pvs) # bad but not awful
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
 0.2615  0.2896  0.3495  0.4173  0.4802  0.9922
> pvs <- getpvals(5,4,100)
> summary(pvs) # degenerate
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  0.406   0.406   0.406   0.406   0.406   0.406  
benthestatistician commented 8 years ago

this methodological paper of (Young, 2016) is relevant.

benthestatistician commented 6 years ago

Conjecture: in the combined difference statistic, you standardize by the generalized inverse of any nonnegative matrix other than the covariance of adjusted differences (relative to the null randomization distribution), the degeneracy problem will disappear. The distribution of this statistic won't be chi-square any more, even for favorable (n,p) combinations; instead the natural approximation for it will be as a weighted sum of chi-squares. This approximation will be similarly non-robust to largish p, but that problem can be kicked down the road a little.

The CompQuadForm package provides quantiles, etc for weighted sums of chi squares.

jwbowers commented 3 years ago

As a hack, could we return the maxT test statistic results from independence_test (for non-cluster situations)? Not 100% sure that these results below are super-favorable for the maxT test, but at least not degenerate. (Including @benthestatistician results from above for comparison.)


R version 4.1.1 (2021-08-10) -- "Kick Things"
Copyright (C) 2021 The R Foundation for Statistical Computing
Platform: x86_64-apple-darwin17.0 (64-bit)

R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.

  Natural language support but running in an English locale

R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.

Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.

> library(coin)
Loading required package: survival
> library(RItools)
Loading required package: SparseM

Attaching package: ‘SparseM’

The following object is masked from ‘package:base’:

    backsolve

> 
> genDat <- function(n, k) {
+   dat <-
+     data.frame(matrix(
+       rnorm(n * k),
+       n, k
+     ))
+   names(dat)[1L:k] <- paste0("x", 1L:k)
+   dat <- data.frame(z = numeric(n), dat)
+   while (var(dat$z) == 0) { # avoid breaking sims b/c z is all 1 or all 0
+     dat$z <- as.numeric(dat[["x1"]] + rnorm(n) > 0)
+   }
+   dat
+ }
> 
> getpvals <- function(n, k, N) {
+   pvs <- replicate(N, {
+     dat <- genDat(n, k)
+     xbp <- xBalance(formula(dat), data = dat, report = "chisq")$overall[, "p.value"]
+     dat$zF <- factor(dat$z)
+     coin_fmla <- as.formula(paste(paste(names(dat)[-1], collapse = "+"), "~zF"))
+     coinp <- pvalue(independence_test(coin_fmla, data = dat))
+     c(xbp, coinp)
+   })
+   pvs
+ }
> 
> pvs <- getpvals(5, 1, 100)
> apply(pvs, 1, summary)
              [,1]       [,2]
Min.    0.04876140 0.05285054
1st Qu. 0.09452078 0.06913298
Median  0.16397380 0.07644819
Mean    0.27407699 0.07581206
3rd Qu. 0.38178422 0.08435319
Max.    0.90999496 0.08885573
> ## > summary(pvs) # conservative, should have rejected ~5 times not none
> ##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
> ## 0.05122 0.10780 0.18080 0.29280 0.42840 0.98620
> pvs <- getpvals(5, 2, 100)
> apply(pvs, 1, summary)
             [,1]       [,2]
Min.    0.1380421 0.07898181
1st Qu. 0.1830466 0.10070713
Median  0.2595936 0.10824416
Mean    0.3449585 0.10775444
3rd Qu. 0.4689084 0.11610262
Max.    0.9948093 0.12805490
> ## > summary(pvs) # more conservative
> ##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
> ##  0.1398  0.2017  0.2694  0.3558  0.4451  0.9912
> pvs <- getpvals(5, 3, 100)
> apply(pvs, 1, summary)
             [,1]       [,2]
Min.    0.2614649 0.09853688
1st Qu. 0.2717162 0.12544456
Median  0.3174379 0.13474620
Mean    0.3743291 0.13414189
3rd Qu. 0.4302293 0.14387579
Max.    0.9702020 0.15716720
> ## > summary(pvs) # bad but not awful
> ##   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
> ## 0.2615  0.2896  0.3495  0.4173  0.4802  0.9922
> pvs <- getpvals(5, 4, 100)
> apply(pvs, 1, summary)
             [,1]      [,2]
Min.    0.4060058 0.1146464
1st Qu. 0.4060058 0.1498375
Median  0.4060058 0.1573900
Mean    0.4060058 0.1559197
3rd Qu. 0.4060058 0.1652102
Max.    0.4060058 0.1841966
> ## > summary(pvs) # degenerate
> ##   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
> ##  0.406   0.406   0.406   0.406   0.406   0.406l
> 
> 
> 
> proc.time()
   user  system elapsed 
  6.159   0.179   6.466 
jwbowers commented 2 years ago

Recording another idea here: Once we can report that we've detected a situation ripe for degenerate p-values we could ask whether the analyst would prefer to use singular values (say no more than 10 or min(\sqrt{n},\sqrt{clusters}) or something). This would still produce using omnibus tests. I recall that maybe Mark already had an idea along these lines a while back.

Also recording the code from above, confirming that coin's quadratic test statistic has the same issues as ours.

chisq_exploration.txt .