markmfredrickson / RItools

Randomization inference tools for R
GNU General Public License v2.0
16 stars 11 forks source link

confirm appropriate weights processing #89

Closed benthestatistician closed 5 years ago

benthestatistician commented 6 years ago

Perhaps relatedly, in [master b9dd3ab] I recorded a note (in matrixNotation.Rnw) about a possible lingering issue with how weights are or aren't being pulled into some SparseM related calculations (see para beginning "A computational issue...").

(Whether or not these are linked, both are on the topic of weights, and both need to be addressed.)

benthestatistician commented 6 years ago

(It's possible that the disagreements in private files referenced above trace not to weights processing but to inclusion of a constant column at the front of NotMissing slots. They were disagreements of the combined differences statistic, which in xBal's case was being computed from a covariates matrix with one fewer column than in the balanceTest case. In balanceTest's additional column, due to some numerical issues it wasn't quite a 0, and the SVD calculations might have been affected. Should check this before spending more time on the weights problem...)

benthestatistician commented 6 years ago

Commit [master bf3cfc6] deals w/ the constant column mentioned in the last comment on this thread, w/o reconciling the two calculations that don't match.

benthestatistician commented 6 years ago

Pretty sure I've figured out the xBalance()/balanceTest() disagreement.

It's a balanceTest() bug due to a some befuddlement that I wrote into RItools:::DesignWeights(). Briefly (and perhaps cryptically), under the default weights setting and without missingness in the covariate one should have "weight ratios" equal to 1, in order to align with displayed equation 5 of Hansen & Bowers (2008); as it stands now that piece of DesignWeights()'s output is instead proportional to the cluster mean of cluster weights within the stratum.

benthestatistician commented 6 years ago

Started a new branch, "issue89", for work on this...

benthestatistician commented 6 years ago

The matrixNotation vignette spells out some of the problems in more detail (as of [issue89 5bfcbab]) also laying out two alternative extensions of the definitions of HB2008 to address missingness. Just after this there's some older commentary referring to what might be justified under "missingness at random" assumptions. If one or the other of the two definitions makes sense under a straightforward assumption about missingness, we should go with it.

Then there will be a need for various adjustments to alignDesignsByStrata(), alignedToInferentials() and associated tests...

markmfredrickson commented 6 years ago

Tasks related:

benthestatistician commented 6 years ago

A more complete cleanup of balanceTest() internals might be possible if we remove post.alignment.transform as a function argument, instead aiming to support that sort of transform only in the special case of the "power user" who's willing to pull out and apply transformations to DesignMatrix objects (more specifically to their Covariates slots). For these users we could make balanceTest into a generic function, as @markmfredrickson and I have discussed offline, and have methods for objects of those types.

Without the need to handle post alignment transforms, I suspect that we could do without the AlignedCovs class, as well as the duplication of n-by-p matrices in the output of alignDesignsByStrata. Instead, the function now called alignedToInferentials could be reorganized just a little so that it doesn't presume that it's receiving covariates that are already aligned. It might receive covariates plus a stratifying variable plus stratum-wise info such as stratum weights, numbers of units assigned to treatment or to control, etc. With this info it does what's needed to calculate the same adjusted mean difference stat and covariances of such stats as it does presently, but without the presumption of the covariates being already aligned.

As it happens, the calculation of the combined diffs stats itself, ssn in the below snippet of alignedToInferentials definition, doesn't presume alignment:

ZtH <- S %*% n.inv %*% n1
ssn <- sparseToVec(t(matrix(zz, ncol = 1) - ZtH) %*% tmat, column = FALSE)

Alignment is presumed when the calculation of the corresponding covariance matrix gets going, on the next line:

scaled.tmat <- as.matrix(tmat * sqrt(dv))
tcov <- crossprod(scaled.tmat)
ssvar <- diag(tcov)    

But the alignment could be performed in the same step with which scaled.tmat is created from tmat, using sparse matrix ops. This should save some memory usage as well.

markmfredrickson commented 6 years ago

So I thought I had some tests that would show if unit weights were being properly handled in the case of non-missingness that used conditional logistic regression and xBalance. They match up without weights, but when I try to include weights, the answers don't line up with the current issue-89, nor when I walk the code backwards. I was under the impression that recent changes broke balanceTest, but that doesn't seem to be the case (at least from my tests).

  library(survival) # for conditional logistic regression 
  set.seed(20180207)
  n <- 100
  x1 <- rnorm(n)
  x2 <- rnorm(n)
  x3 <- 0.5 + 0.25 * x1 - 0.25 * x2 + rnorm(n)
  idx <- 0.25 + 0.1 * x1 + 0.2 * x2 - 0.5 * x3 + rnorm(n)
  y <- sample(rep(c(1,0), n/2), prob = exp(idx) / (1 + exp(idx)))

  xy <- data.frame(x1, x2, x3, idx, y)
  xy$m[y == 1] <- order(idx[y == 1])
  xy$m[y == 0] <- order(idx[y == 0])

  xb1 <- xBalance(y ~ x1 + x2 + x3, data = xy, strata = list(unmatched = NULL, matched = ~ m), report = c("all"))
  bt1 <- balanceTest(y ~ x1 + x2 + x3 + strata(m), data = xy, report = "all",)
  cr1 <- clogit(y ~ x1 + x2 + x3 + strata(m), data = xy)

  expect_equivalent(xb1$overall$chisquare, bt1$overall[2:1, "chisquare"])
  expect_equivalent(xb1$overall$p.value[2], summary(cr1)$sctest["pvalue"])
  expect_equivalent(bt1$overall[1, "p.value"], summary(cr1)$sctest["pvalue"])

These tests pass. The next tests do not.

  wts <- rpois(n, 7)
  wts <- wts / sum(wts)
  xy$wts <- wts # NB: balanceTest doesn't seem to like just using wts directly during tests, but is fine with it for interactive sessions
  xy.wts <- data.frame(x1 = xy$x1 * wts, x2 = xy$x2 * wts, x3 = xy$x3 * wts, idx = xy$idx, y = xy$y, m = xy$m)

  xb2 <- xBalance(y ~ x1 + x2 + x3, data = xy.wts, strata = list(unmatched = NULL, matched = ~ m), report = c("all"))
  bt2 <- balanceTest(y ~ x1 + x2 + x3 + strata(m), data = xy, unit.weights = wts, report = "all",)
  cr2 <- clogit(y ~ x1 + x2 + x3 + strata(m), data = xy.wts)

comparing the last three items yields

> Error: xb2$overall$chisquare not equivalent to bt2$overall[2:1, "chisquare"].
2/2 mismatches (average diff: 0.0943)
[1] 1.02 - 1.04 == -0.0201
[2] 1.49 - 1.33 ==  0.1685
> >   expect_equivalent(bt2$overall[1, "p.value"], summary(cr2)$sctest["pvalue"])
Error: bt2$overall[1, "p.value"] not equivalent to summary(cr2)$sctest["pvalue"].
1/1 mismatches
[1] 0.723 - 0.684 == 0.0394
benthestatistician commented 6 years ago
> expect_equivalent(xb2$overall$p.value[2], summary(cr2)$sctest["pvalue"])
> 
benthestatistician commented 6 years ago

The problem this issue addresses came about as a result of my attempting to be a little too clever about handling of missing Xs. A better starting point for addressing that problem is to tap into other packages that are designed to handle missing Xs. #94 is directed in part at that.

benthestatistician commented 6 years ago

Note to self: in Design.R, wtratio seems to correspond to $s_b / h_b$, in matrixNotation.Rnw notation. Here $s_b$ has been normalized to sum to 1 although $h_b$ has not. That in itself is as it should be, but the weight ratio should really be $\frac{s_b}{h_b \bar{w}_b}$. That makes a bit more sense in itself, as well as lining up better w/ eq:13 (presently display (7) ) in matrixNotation.Rnw.
But can this be achieved w/o big changes to code and/or API? I suspect yes.

To-dos:

benthestatistician commented 6 years ago

Reflections on why these tests of Mark's continue to fail:

  xb2 <- xBalance(y ~ x1 + x2 + x3, data = xy.wts, strata = list(unmatched = NULL, matched = ~ m), report = c("all"))
  bt2 <- balanceTest(y ~ x1 + x2 + x3 + strata(m), data = xy, unit.weights = wts, report = "all",)
  cr2 <- clogit(y ~ x1 + x2 + x3 + strata(m), data = xy.wts)
  expect_equivalent(xb2$overall$chisquare, bt2$overall[2:1, "chisquare"])
  expect_equivalent(xb2$overall$p.value[2], summary(cr2)$sctest["pvalue"])
  expect_equivalent(bt2$overall[1, "p.value"], summary(cr2)$sctest["pvalue"])

(They look a bit different in current code; look in test.balanceTest.R, near bottom of test "balanceTest agrees with other methods where appropriate".)

  1. xBal() interprets stratum weight ratios to mean $s_b / h_b$, in matrixNotation.Rnw notation, whereas for balT() the interpretation is $\frac{s_b}{h_b \bar{w}_b}$. Accordingly within these tests the x1, x2 and x3 cols of xy.wts should have $\bar{w}_b^{-1}$ factors multiplied through them.
  2. balT() should have an implicit 4th variable, weight/size, that's being factored into its combined differences stat; but not so for xBal() or clogit(), unless in those cases we put it there of course.
benthestatistician commented 6 years ago

I'm having a crisis of faith in this calculation, within RItools:::alignDesignsByStrata():

covars.Sctr[,jj] <- if (all(covars[,jj]==covars[1L,jj])) 0 else 
                slm.wfit.csr(  
                    S, covars[,jj],  
                    weights=ewts[, max(1L, covars.nmcols[jj]), drop = TRUE])$residuals

The calculation is there to stratum-center the covariates, or more specifically the cluster totals of covariates. An alternative would be to do something more like

covars.Sctr[,jj] <-  
  slm.fit.csr(
              S, covars[,jj] * weights=ewts[, max(1L, covars.nmcols[jj]), drop = TRUE]
              )$residuals

and at the same time make the change that the output of this function is expected to be not aligned covariate means, and separately unit weights, but rather aligned covariate totals, and no unit weights.

Shifting to covariate totals for inferential calculations would line up better with the setup of Hansen & Bowers 2008, I now think.

In addition, the current approach has the drawback that there's no clear path to getting z statistics for size differences between the groups. (I tried to do this by adding a column of constants, n=1, but that didn't work; it just reports NA in the z column, as a downstream result of the fact that from the weighted lm perspective those 1's for different clusters are all the same thing, irregardless of differences in cluster size.)

benthestatistician commented 5 years ago

Note to self: for now, branch for this work is "proj1-balT", not "issue89" (which has been merged into proj1-balT & deleted from GH).

benthestatistician commented 5 years ago

The "crisis of faith" noted above turns out to have been a personal issue, not an problem with the code. The concern expressed here was addressed in 2fff4a0 and preceding commits. See discussion in matrixNotation, Sec 2 ("Purposes and Premises: Inferential Statistics").

Reviewing the code & this issue exposed an unnecessary complication w/ the CovsAlignedToADesign class. I'll close this out again when I address that.

benthestatistician commented 5 years ago

For ease of reference here's the passage of matrixNotation explaining what's going on w/ stratum-wise alignment.

image