markmfredrickson / RItools

Randomization inference tools for R
GNU General Public License v2.0
17 stars 11 forks source link

touch up signif tests in `xBalance` when invoked w/ a post-alignment transform #27

Closed benthestatistician closed 10 years ago

benthestatistician commented 10 years ago

Presently post alignment transforms affect standard error calculations, but not the calculation of the statistics for which the standard errors are being calculated. I think that the adjusted means, adjusted mean differences and standardized differences the function reports back to users should not be affected by the post-alignment transform, so for purposes of reporting this is correct.

On the other hand, for the z statistics, p-values and chisquare statistics, we need adjusted differences and corresponding standard errors neither of which have been calculated using alignment or both of which have been calculated after alignment. I think it should be both. A simple answer is to re-calculate ssn just after this point in xBalanceEngine:

 if (!is.null(post.align.trans)) {
    tmat.new <- apply(tmat, 2, post.align.trans)

I.e., on the next line add in

ssn <- drop(crossprod(zz-ZtH, tmat.new)) 

Before implementing, merits a second pair of eyes looking for undesirable side effects.

benthestatistician commented 10 years ago

Noting that once we've dealt with this issue, we'll be able to get Wilcoxon rank sum and signed rank tests by passing post.alignment.transform=rank. Perhaps this is low-lying fruit. Jake, do you have time to take a look?

benthestatistician commented 10 years ago

An omission of my original suggestion is that when a post alignment transform has been provided and the user requests both adj.mean.diffs and adj.mean.diffs.null.sd, the post alignment transformation occurs after the computation of adj.mean.diffs but before that of their null SDs. The two will be out of step.

There are two issues: whether to reconcile the discrepancy by applying the post alignment transformation before both, or neither; and how to code up the change. My feeling is that both should be happening before the transformation: we want the adjusted mean difference that we report to be as interpretable as possible, and the adj.mean.diffs.null.sd is largely a pedantic device, for clarifying what randomization inference is all about. Post alignment transforms are inside baseball, and the people who need to see the adj.mean.diff.null.sd don't want to be burdened with them. Plus, it's reasonably coherent to have those transforms apply only to the calculation of z statistics, chisquare statistics and the p-values attaching them.

A way to implement that change would be to fill out the code block running from if (!is.null(post.align.trans)) through assignment of ans[['adj.diff.null.sd']] something like the following.

dv <- unsplit(tapply(zz,ss,var), ##sample variance of treatment
                ss)
if ('adj.mean.diffs.null.sd' %in% report) {
    ssvar <- apply(dv*tmat*tmat, 2, sum) 
    ans[['adj.diff.null.sd']] <- sqrt(ssvar*(1/wtsum)^2)
  }
  if (!is.null(post.align.trans)) {
    tmat.new <- apply(tmat, 2, post.align.trans)

    # Ensure that post.align.trans wasn't something that changes the size of tmat (e.g. mean).
    # It would crash later anyway, but this is more informative
    if (is.null(dim(tmat.new)) || !all(dim(tmat) == dim(tmat.new))) {
      stop("Invalid post.alignment.transform given")
    }

    tmat <- tmat.new
    ssvar <- apply(dv*tmat*tmat, 2, sum) 
  }

It's a little bit complicated, but it avoids nested conditionals. ssvar may be calculated twice, but I don't see a way around this when the user has both given us a transform and requested null SDs. (At least not if you accept my view on how this situation ought to be handled.)

jwbowers commented 10 years ago

Good points. I'll work on it. For now, I'm working a bit on interpretation here is the rough equivalent of what our approach to post transformation does for a single variable (from my as yet not committed inst/tests/test.xBalance.R

res6 <- xBalance(pr ~ cost, data=nuclearplants, post.alignment.transform = rank, report="all")

  mm<-nuclearplants$cost
  zz<-nuclearplants$pr
  tmatRank<-rank(mm-mean(mm))
  zzMd<-zz-mean(zz)
  lm6<-lm(tmatRank~zzMd-1)
  summlm6<-summary(lm6)$coef

  expect_true( all(summlm6[,3:4]-res6$results[,c("z","p"),] < c(.1,.1)) )

If you play with this code, you'll notice that the centering step is crucial. We are not testing the difference in means of ranked outcomes here and are not coming close to p-values from wilcox.test or lm without this centering.

jwbowers commented 10 years ago

I've made a branch --- posttransform --- that contains the work I've done so far. I'll get back to this later.

jwbowers commented 10 years ago

I've implemented these changes in [posttransform 4a87151]. I also show in inst/tests/test.xBalance.R that

  res6 <- xBalance(pr ~ cost, data=nuclearplants, post.alignment.transform = rank, report="all")

Produces z and p very similar to

lm7<-lm(I(rank(cost))~I(pr-mean(pr))-1,data=nuclearplants)

but not either of these

lm8<-lm(I(rank(cost))~I(pr-mean(pr)),data=nuclearplants)
lm9<-lm(I(rank(cost))~pr,data=nuclearplants)

Next steps include documenting this and fixing the chi.square problem in issue #28 (the p-value for the chi-square on the univariate, unadjusted test should be the same as the p-value for the z-statistic).

markmfredrickson commented 10 years ago

Fixed in #29.