maximerischard / DeconvolutionTests.jl

Test for equality of two distributions given observations with known measurement errors.
Other
0 stars 0 forks source link

Incorporate the deconvolution error #8

Open maximerischard opened 6 years ago

maximerischard commented 6 years ago

The first step of our algorithm is the deconvolution of all the data (obtaining an estimate of F_0). My intuition was that the uncertainty in this deconvolution is not crucial. But is this correct? Can we make this intuition more precise?

It would be possible to incorporate the uncertainty in the deconvolution by first bootstrapping the original data (nonparametric bootstrap). What do we gain by doing so? Is it more valid? more robust? more powerful?

Are there other ways to handle this uncertainty?

maximerischard commented 6 years ago

The bootstrap would also be a way to do a kind of sensitivity analysis: wiggle the data around and check that the p-value doesn't change too much (i.e. check that the distribution of the test statistic under the null is stable under reasonable perturbations of the assumed null distribution). It could be something we do in an applied example in the paper, and something we recommend to people trying out new test statistics.

lfcampos commented 6 years ago

The Efron deconvolution comes with error bars, so if we're willing to make distributional assumptions on the underlying (deconvolved) distribution function we we may be able to skip the bootstrap and only do the deconvolution step once, saving us lots of computation (having to deconvolve every bootstrap sample).

I think the underlying distribution is a heroic assumption certainly in most applications, but in simple applications (like the one they gave us) the gains in computation might be worth the assumption.

maximerischard commented 6 years ago

But isn't the underlying distribution modeled as this weird exponential family spline thingy? It seems quite flexible, and maybe not as strong of an assumption as it first seems.

Using the error bars directly is an interesting possibility. Is there any sense of the correlation between two points on the estimated PDF/CDF?

lfcampos commented 6 years ago

That's an interesting point, I just tried simulating x_i from a bimodal distribution, and just used deconv directly assuming Normal g(theta) and I was surprised by this result (see below). It's not perfect (which worries me with N = 5K) but it's also capturing the shape relatively well (assuming degree 5 polynomial).

In terms of the correlation, the estimated g(theta) do come with a covariance matrix, In the image the light grey lines correspond to draws from a multivariate Normal with mean and covariance of the g(theta) estimate.

download-1

lfcampos commented 6 years ago

Here it is though, for varying polynomial degrees. This is encouraging. Assuming a 2nd degree polynomial is assuming Normality, what are the other distributions then? I think we need to read the paper that led to the Efron 2016 paper we're reading:

EFRON, B. (2014). Two modelling strategies for empirical Bayes estimation. Statist. Sci. 29, 285–301.

download-3

maximerischard commented 6 years ago

Presumable the degree can be tuned? Through cross-validation or some sort of shortcut.

This does raise a potential weakness of the deconvolution+bootstrap approach: the null hypothesis that's used as a substitute for the true null hypothesis will look more like what the deconvolution algorithm expects to see. I think this could lead to reduced power, though hopefully not invalidity. What I'm thinking of is that one could imagine two distributions that are different but converge to the same distribution under “decon degree: 5” for example, so we have no power to tell them apart no matter how much data is available.