Closed gaow closed 5 years ago
Update: we believe this is because of mismatch of LD reference and the actual X'X that generated those GWAS summary statistics. Now @zouyuxin re-implemented z-score version BF computations in Wakefield's ABF notation and estimating a quantity involving both prior and residual variance (as opposed to estimating them separately). The procedure is being tried out by others using my updated fine-mapping pipeline, along with the old susie_bhat
method. We'll see what they'll get.
Since people are now using the package quite actively I think we need to do something to clarify between what we are willing to stand behind and what methods are currently in testing (so people should probably not be using, or at least at their own risk).
I am currently only comfortable standing behind the full data version and the
version that does the full data computations exactly (susie_ss
).
Even with susie_ss
I just personally had problems because I input XtX and Xty without
first centering X and y! Can we put in check that quit (or give warning) if XtX and Xty don't look
like they are centered? For example, I think it should be true that sum(XtX) and sum(Xty) should be zero (within numeric tolerance) if both X and y were centered. [For bhat, presumably sum(bhat) should be zero?]
For susie_z and susie_bhat, can we have them emit a warning that they are in development and they should proceed at own risk? We might also consider whether these are the right names....
One starting point would be to be precise about the input arguments in the documentation (and then carefully check the input arguments in the code); for example, for input X to susie
, it only says X : An n by p matrix of covariates
. Can p = 0 or 1? Can n be 0 or 1? Can X be sparse? Does X have to be numeric? Can it be integer-valued; complex? Can it have missing or non-finite values? What other constraints are there on X?
This is only an illustration, but I think gives you the idea.
I think if we want to do minimal work (in terms of typing) to get close to what we need, then the two lines I'll add is to throw warnings when susie_z
and susie_bhat
is called. Also maybe we do not provide access to susie_ss
, but rather, for input data we compute XtX
and Xty
and use susie_ss
internally for computational reasons, when we deem it appropriate to do based on dimension of input data?
@gaow @zouyuxin
let's talk about the names. As I understand it the main difference between
susie_bhat
and susie_z
is that susie_bhat
uses summary data to
approximate the sufficient statistics and feeds them into the susie_ss
algorithm.
That is, the approximation is at the level of the algorithm.
In contrast, susie_z
(which is more in progress) tries to approximate the model
for the summary data first, and then fit the approximate model. This is in the spirit
of the "regression with summary statistics" likelihood (RSS-likelihood) work.
There is nothing inherent about bhat and z in these two contrasting approaches -
in principle both could be applied to either bhat or z... Eg one can
analyse z scores by feeding them to susie_bhat
with se=1.
Based on this I would propose to rename susie_z
to susie_rss
.
I'm not sure about name for susie_bhat
. We could leave that alone for now and
just mention that it can be used for z scores too?
I do think we should advertise susie_ss
as "sufficient statistics" rather than "summary statistics".
As discussed, for starters, @zouyuxin will implement susie_ss
with "sufficient statistics" where sample size n
is optional. Then change susie_z
, which currently implements the RSS model, to susie_rss
, adding checks on whether input summary statistics and LD reference are in the same space, and add warning messages to explain that the feature is experimental. From there we can run some benchmarks to better understand the problelm. We will then decide on what to do with susie_bhat
.
actually can I suggest the following:
for susie_ss
, we implement as you suggest, but also implement checks for whether
X'X and X'Y are i) computed from centered data (eg they should both be mean 0)
and ii) consistent with one another; eg X'Y should lie in the space spanned by
the non-zero eigenvectors of X'X. [Actually ii) may be somewhat tricky if we want to do
it cheaply without doing an eigen-decomposition of X'X..? Any ideas?.]
If either check is failed we could throw a warning. Or, more aggressively, we could throw an error and
add a parameter skip_checks=FALSE
that must be set by the user to TRUE to override...
(If the check ii is indeed expensive this could have the advantage of skipping the check...)
for susie_bhat
we keep as convenient interface to call susie_ss
(mostly like we do now...?)
My idea is that if susie_bhat
is called with "approximate" data (eg out-of-sample R) the checks in susie_ss
will handle it adequately.
for susie_z
we rename to susie_rss
, add a warning that it is in development,
and do not advertise it in vignettes.
See this example. We have to verify again the ELBO computation and scaling in derivations, in light of what we have observed. This is not of the highest priority though.