Possible bugs in summary statistics version of SuSiE

gaow commented 5 years ago

See this example. We have to verify again the ELBO computation and scaling in derivations, in light of what we have observed. This is not of the highest priority though.

gaow commented 5 years ago

Update: we believe this is because of mismatch of LD reference and the actual X'X that generated those GWAS summary statistics. Now @zouyuxin re-implemented z-score version BF computations in Wakefield's ABF notation and estimating a quantity involving both prior and residual variance (as opposed to estimating them separately). The procedure is being tried out by others using my updated fine-mapping pipeline, along with the old susie_bhat method. We'll see what they'll get.

stephens999 commented 5 years ago

Since people are now using the package quite actively I think we need to do something to clarify between what we are willing to stand behind and what methods are currently in testing (so people should probably not be using, or at least at their own risk).

I am currently only comfortable standing behind the full data version and the version that does the full data computations exactly (susie_ss).

Even with susie_ss I just personally had problems because I input XtX and Xty without first centering X and y! Can we put in check that quit (or give warning) if XtX and Xty don't look like they are centered? For example, I think it should be true that sum(XtX) and sum(Xty) should be zero (within numeric tolerance) if both X and y were centered. [For bhat, presumably sum(bhat) should be zero?]

For susie_z and susie_bhat, can we have them emit a warning that they are in development and they should proceed at own risk? We might also consider whether these are the right names....

pcarbo commented 5 years ago

One starting point would be to be precise about the input arguments in the documentation (and then carefully check the input arguments in the code); for example, for input X to susie, it only says X : An n by p matrix of covariates. Can p = 0 or 1? Can n be 0 or 1? Can X be sparse? Does X have to be numeric? Can it be integer-valued; complex? Can it have missing or non-finite values? What other constraints are there on X?

This is only an illustration, but I think gives you the idea.

gaow commented 5 years ago

I think if we want to do minimal work (in terms of typing) to get close to what we need, then the two lines I'll add is to throw warnings when susie_z and susie_bhat is called. Also maybe we do not provide access to susie_ss, but rather, for input data we compute XtX and Xty and use susie_ss internally for computational reasons, when we deem it appropriate to do based on dimension of input data?

stephens999 commented 5 years ago

@gaow @zouyuxin

let's talk about the names. As I understand it the main difference between susie_bhat and susie_z is that susie_bhat uses summary data to approximate the sufficient statistics and feeds them into the susie_ss algorithm. That is, the approximation is at the level of the algorithm.

In contrast, susie_z (which is more in progress) tries to approximate the model for the summary data first, and then fit the approximate model. This is in the spirit of the "regression with summary statistics" likelihood (RSS-likelihood) work.

There is nothing inherent about bhat and z in these two contrasting approaches - in principle both could be applied to either bhat or z... Eg one can analyse z scores by feeding them to susie_bhat with se=1.

Based on this I would propose to rename susie_z to susie_rss. I'm not sure about name for susie_bhat. We could leave that alone for now and just mention that it can be used for z scores too?

I do think we should advertise susie_ss as "sufficient statistics" rather than "summary statistics".

gaow commented 5 years ago

As discussed, for starters, @zouyuxin will implement susie_ss with "sufficient statistics" where sample size n is optional. Then change susie_z, which currently implements the RSS model, to susie_rss, adding checks on whether input summary statistics and LD reference are in the same space, and add warning messages to explain that the feature is experimental. From there we can run some benchmarks to better understand the problelm. We will then decide on what to do with susie_bhat.

stephens999 commented 5 years ago

actually can I suggest the following:

for susie_ss, we implement as you suggest, but also implement checks for whether X'X and X'Y are i) computed from centered data (eg they should both be mean 0) and ii) consistent with one another; eg X'Y should lie in the space spanned by the non-zero eigenvectors of X'X. [Actually ii) may be somewhat tricky if we want to do it cheaply without doing an eigen-decomposition of X'X..? Any ideas?.] If either check is failed we could throw a warning. Or, more aggressively, we could throw an error and add a parameter skip_checks=FALSE that must be set by the user to TRUE to override... (If the check ii is indeed expensive this could have the advantage of skipping the check...)
for susie_bhat we keep as convenient interface to call susie_ss (mostly like we do now...?) My idea is that if susie_bhat is called with "approximate" data (eg out-of-sample R) the checks in susie_ss will handle it adequately.
for susie_z we rename to susie_rss, add a warning that it is in development, and do not advertise it in vignettes.

stephenslab / susieR

Possible bugs in summary statistics version of SuSiE #70