stephenslab / susieR

R package for "sum of single effects" regression.
https://stephenslab.github.io/susieR
Other
172 stars 43 forks source link

Coverage clarification #164

Closed RyanLongchamps closed 2 years ago

RyanLongchamps commented 2 years ago

Hello,

I was hoping for some clarification on the calculation of PIP versus coverage with SuSiE when L > 1. My understanding is that coverage is essentially the probability the causal variant is contained within the credible set. In single-variant fine-mapping this should just be the sum of the PIP until a given coverage threshold is reached.

However, for multi-variant fine-mapping (or when L > 1 in the case of SuSiE), my understanding is that coverage is now calculated as the joint posterior probability of all SNPs within the credible set being > 0.95. The PIP on the other hand is the marginal posterior probability of a single SNP being causal. As a result the sum of all PIPs in a given credible set != the coverage - am I interpreting this correctly?

Thank you!

xinyixinyijiang commented 2 years ago

Same question here!

gaow commented 2 years ago

Coverage of CS in SuSiE is always a statement regarding single effects, regardless of L. CS is based on single effect PIP, PIP_l, and is constructed for every single effect SuSiE aims to capture. The reported PIP, on the other hand, is aggregated across all single effects: PIP_j = \sum PIP_lj for SNP j. In many cases, only one of the PIP_lj is non-zero and the rest are close to zero, and that's when the sum of PIP in a single effect CS reflects the coverage. However there are also cases when more than one PIP_lj is non-zero, in which case the sum of PIP will exceed the coverage by a bit. You see this especially when two single effect CS overlap with each other -- is that what you observe?

gaow commented 2 years ago

I should add that you can find PIP_l as $alpha in susieR::susie()/susie_rss() output

RyanLongchamps commented 2 years ago

Hi Gao,

Thanks for the clarification. Yes, I do tend to see a coverage > the sum of the PIP when there are more than one overlapping CS. In these instances generally only one remains presumably due to purity filtering.

One thing I am curious about is to what extent would you expect the sum of the PIP to be greater than the coverage in these instances? You said a bit, and indeed often times when I am observing this the sum of the PIP is barely > 1, however I have observed a few instances where the sum of the PIP is as high as 1.2.

gaow commented 2 years ago

however I have observed a few instances where the sum of the PIP is as high as 1.2.

Can I ask what your concern is in this context? It depends on how much of an "overlap" your single effects are. In my experience (more on molecular QTL and limited with GWAS data which I assume is your context?), i only occasionally see overlapping CS and when that happens, we may try some alternative initializations or use refine parameter in susie() to see if that helps. But the concern there would be the algorithm trapped in local optima and not so much that a CS has sum of PIP > 1.

xinyixinyijiang commented 2 years ago

Ryan - Thank you for opening this issue!

Hi Gao,

For my case, I used z scores and LD matrix/R from the same sample (n > 70,000). I got 167 CS from 113 loci with 39 CS having sum of reported pip > 1. The highest five sum of pip are 2.048540, 1.464357, 1.294889, 1.257850, and 1.192905.

I calculated the mean of alpha (for each L) for the most extreme CS case (i.e. sum of pip equals 2.048540). Here are the values: [1] 1.486848e-10 3.571854e-03 5.387378e-04 5.191457e-04 5.052033e-04 5.008242e-04 5.048004e-04 5.144049e-04 5.272693e-04 [10] 5.414172e-04.

It looks like that alpha does not close to 0 when L >=3. This is a pretty big CS with 266 SNPs and I can see that the alpha does not differ much between SNPs. I can also see other overlapping CSs filtered out because of 'purity' threshold.

So do you also recommend using "alternative initializations or refine parameter in susie()" in this case? Will lowering L be feasible, let's say from 10 to 5?

Thank you!

gaow commented 2 years ago

I calculated the mean of alpha (for each L) for the most extreme CS case (i.e. sum of pip equals 2.048540).

This sounds like a useful case to check -- is there an overlap of CS here? I am not sure about the mean of alpha ... i doubt it's a useful quantity. What you want to look out for are those SNPs with non-zero PIP_l in more than one effect, in other words, those PIP_j >> max(PIP_lj) particularly when SNP j is in multiple CS. We can then try to refine the analysis to see if that help with this possible convergence issue.

Will lowering L be feasible, let's say from 10 to 5?

did you use estimate_prior_variance = T? If so, usually L sholud not matter. But things can potentially be more complicated when you use summary statistics. Checking overlaps in CS is what I would do first.

RyanLongchamps commented 2 years ago

Hi Gao,

Yes, you are correct - my context is GWAS.

I guess my concern is PIPs being affected by non-viable credible sets. As you mention we often see sum of PIP > Coverage when there are overlapping Credible Sets. However, in these instances only one of the credible sets tends to survive the purity threshold filtering. As such, I am worried I am seeing inflated PIPs for SNPs in overlapping credible sets, however only one of these credible sets is worth looking at in the first place and thus the PIP should not be affected by these low purity credible sets.

I think in most cases this isn't much of an issue as the PIPs are barely > the coverage, however in a few cases such as the one posted above. Maybe I am not thinking about this correctly, but as always I just want to ensure my output data is being properly interpreted.

Thank you!

stephens999 commented 2 years ago

If you are primarily interested in fine mapping reliable signals I would focus on the reported CSs and, for each SNP in the CS, its alpha value in that CS (PIP_l in Gaow's notation, which is essentially the SNPs "PIP" in that CS). These should not be inflated by overlap of CSs, and the sum of the alpha values for the SNPs in a CS will give (an estimate of) the coverage of that CS.

The PIPs may or may not be useful for comparison with other methods and for other analyses, but they seem in this case just to be distracting.

stephens999 commented 2 years ago

PIP_j = \sum PIP_lj for SNP j

is this right? I thought it was PIPj = 1- \prod(1-PIP{lj})

RyanLongchamps commented 2 years ago

the sum of the alpha values for the SNPs in a CS will give (an estimate of) the coverage of that CS

Great! I appreciate the clarification that the sum is merely an estimate and may not always be equivalent to the coverage

Thank you!

gaow commented 2 years ago

is this right? I thought it was PIPj = 1- \prod(1-PIP{lj})

Oh of course it's PIP_j = 1- \prod(1-PIP_{lj}) as we implemented in the package. Sorry I wasn't thinking when I was typing that ...

However, in these instances only one of the credible sets tends to survive the purity threshold filtering

Sounds like it's not an overlapping CS situation, which would be harder to interpret, and I'd worry (curious) about a possible convergence issue. In your case as @stephens999 said I would focus on the single effects SuSiE reliably capture and resort to the definition of CS.

@stephens999 suppose we do have overlapping CS, then even though each CS is by definition correct, how would you conclude for that region? I would hesitate to claim that we captured 2 independent signals, right?

I appreciate the clarification that the sum is merely an estimate and may not always be equivalent to the coverage

Yes, coverage is defined by PIP_{lj} not PIP_j. Sorry if that's not clear before.