stephenslab / susieR

R package for "sum of single effects" regression.
https://stephenslab.github.io/susieR
Other
169 stars 42 forks source link

Behavior of prior_tol and tol thresholds #195

Open jerome-f opened 1 year ago

jerome-f commented 1 year ago

Hi,

I have noticed that at loci with non-sparse situations both susie_stuff_stat and susie_rss takes forever to converge to the default ELBO tolerance of tol=0.001, relaxing this to tol=0.01 achieved faster convergence with almost identical results (some floating point differences in PIP and betas etc).

Similarly, in loci with many causal signals the default prior variance threshold of prior_tol=1e-9renders the algorithm to get stuck in resolving the last single effect. relaxing that to prior_tol=1e-8 misses one credible set but runs quite fast.

I haven't thoroughly investigated the effect of adjusting these tolerances but wanted to doc this here. (both the above situations involved in sample LD and often >5 signals at a loci). The prior variance threshold I fee can be inferred form existing GWAS or simulation with different causal signal scenarios.

It would be great if a fail safe is put in place to prevent the IBSS from getting stuck with convergence (refine=TRUE and max-iter apart)

pcarbo commented 1 year ago

@jerome-f Have you found that refine = TRUE helps (produces better fits)?

We have also found that sometimes estimate_prior_method = "EM" produces better results (although not necessarily faster convergence).

pcarbo commented 1 year ago

Also, if you have examples (code and data) of particular poor results or slow convergence that you would be willing to share (here, or privately), we would be happy to investigate more closely.

jerome-f commented 1 year ago

@pcarbo yes refine=TRUE worked for most cases to achieve better fits. But in the scenarios described above it actually didn't help much. Unfortunately I am not at liberty to share the data due to consortium agreements. But I can envision a sweep search with simulated data (similar to hyper-parameter tuning in ML context). Setting the estimate_prior_method = "EM" was necessary or the estimate for residual variance would become negative in some cases. To give a context these are cis-QTL loci where there are >5 (with max of 59 signals in one case).

pcarbo commented 1 year ago

Sounds like you are already doing what I would have done. These may be particularly challenging fine-mapping examples, more challenging than we have dealt with in the past. Thanks for sharing your observations with us.

jerome-f commented 1 year ago

Although I cannot share the data directly I can point you to the case on PanUKBB

gaow commented 10 months ago

The prior variance threshold I fee can be inferred form existing GWAS or simulation with different causal signal scenarios.

Take height for example, from the abstract of the link:

In out-of-sample estimation and prediction, the 12,111 SNPs (or all SNPs in the HapMap 3 panel2) account for 40% (45%) of phenotypic variance in populations of European ancestry

That is, 0.4 / 12111 = 3.302783e-05. So, maybe a tol of 5E-6 is reasonable, assuming no other disease is going to be more polygenic than height?

jerome-f commented 9 months ago

The prior variance threshold I fee can be inferred form existing GWAS or simulation with different causal signal scenarios.

Take height for example, from the abstract of the link:

In out-of-sample estimation and prediction, the 12,111 SNPs (or all SNPs in the HapMap 3 panel2) account for 40% (45%) of phenotypic variance in populations of European ancestry

That is, 0.4 / 12111 = 3.302783e-05. So, maybe a tol of 5E-6 is reasonable, assuming no other disease is going to be more polygenic than height?

This seems reasonable.

jerome-f commented 5 months ago

Hi @pcarbo and @gaow,

I am adding a susie RSS fit LD and Z-score file that in my HPC system did not produce an output even after running for 3 days (Rfast, most recent git version of SuSie installed). FINEMAP reported 20 independent credible sets. This is a cis-QTL loci. I have limited bandwidth to explore why such a long runtime but thought I would pass it along. You start the fit with below code

library( susieR )
z<-readRDS('summarstat.rds')
ld<-readRDS('ldmat.rds')
fit   <- susie_rss( z=z$Z,R=as.matrix(ld),n=27425,L=21,estimate_residual_variance = T , verbose = T ,refine = T)

file can be found in this link

pcarbo commented 5 months ago

@jerome-f I believe that the refinement procedure is very slow for large L. Did you try with refine = FALSE?

jerome-f commented 5 months ago

@pcarbo Tried both refine=FALSE, and estimate_prior_method = "EM". But the job still didn't produce an output after 3 days. The objective stops updating in the stdout few hours after launching but I see the job consuming both memory and CPU likely it is stuck in purity filtering for CS ? I haven't tried modifying the purity thresholds.

pcarbo commented 5 months ago

@jerome-f You may be right about the purity filtering. I haven't pinpointed the issue, but I found this workaround that seems to work:

library(susieR)
set.seed(1)
z <- readRDS("summarstat.rds")$Z
R <- readRDS("ldmat.rds")
fit <- susie_rss(z = z,R = R,n = 27425,L = 21,
                 min_abs_corr = NULL,
                 estimate_residual_variance = TRUE,
                 verbose = TRUE,refine = FALSE)
out <- susie_get_cs(fit,Xcorr = R,min_abs_corr = 0,
                    check_symmetric = FALSE)
sapply(out$cs,length)
# L1  L2  L3  L4  L5  L6  L7  L8 L11 L12 L13 L15 L16 L18 L20 L21 L14 L10 L19 L17
#  1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   2   5   4  19
# L9
#  6
jerome-f commented 5 months ago

Thanks that's good to know and can confirm works smooth at my end as well. Quick question about setting min_abs_corr=NULL, does it ignore correlations altogether ? I.e. it can potentially lead to a CS where two variants have r^2 > 0.5 ? Or the other way every variant with r^2 >0 are separated into individual CS (my intuition says is the former scenario).

Another Q I see you set the random seed, I don't usually set it, but have you noticed any difference in the simulations if you vary the seed. i.e. how sensitive/stable is the objective function to the random seed ? Sorry this is something I can experiment and find out as well but thought you'd probably know the answer

pcarbo commented 5 months ago

min_abs_corr=NULL, does it ignore correlations altogether?

Looking at the code, it seems to have the effect of skipping the CS calculation altogether.

Another Q I see you set the random seed, I don't usually set it, but have you noticed any difference in the simulations if you vary the seed. i.e. how sensitive/stable is the objective function to the random seed?

I think it is not needed, I just do it out of habit.

pcarbo commented 5 months ago

I think you can also skip the purity calculation by not providing Xcorr to susie_get_cs.