Numerous CS with very large log10 BF

aob93 commented 4 years ago

Hi there,

We've been running SuSiE to derive credible sets for many different loci, and the results so far have made a lot of sense. However, other loci are proving to be more difficult. In these cases, SuSiE takes a very long time to run (around 10 hours) and produces as many as 8 credible sets at the locus, with suspiciously high log10BFs.

For example, running the following command; SuSiE.fit <- susie_rss(zscores,ld,max_iter=100000,L=10)

returns;

summary(SuSiE.fit)$cs cs cs_log10bf cs_avg_r2 cs_min_r2 variable 1 1 149264.7 1 1 1008 2 2 45979.0 1 1 1031 3 3 15692025.6 1 1 1027 4 4 10134281.4 1 1 1019 5 5 2946177.3 1 1 1014 6 6 143381.9 1 1 993 7 7 1581890.1 1 1 1024 8 8 12347874.2 1 1 1028,1030

Upon closer inspection, these variants are all highly correlated with each other in 1000GP (R2 > 0.95) , and we would expect them to be found in the same CS.

This has made me think that it is the LD matrix which is problematic, however, the same script works perfectly at other loci.

To calculate the correlation matrix, we use the following code'

plink <- read.table(file="WHR.plink.raw",header=T) N <- dim(plink)[1] X <- as.matrix(plink[,-c(1:6)]) X <- scale(X, center = TRUE, scale = TRUE) ld <- t(X)%*%X/N

Where the .raw file is the result of using the recode -A function from plink. To my knowledge, this produces the correct type of matrix for the analysis and works perfectly at other loci.

Are there any obvious errors that I am making or any reasons from your own experience why credible sets like these would be outputted?

Many thanks.

gaow commented 4 years ago

@aob93 thanks for your message. 10 hours is very long time. Can you let us know how many variants are there in the loci in question, and how many iterations it took to converge, if ever? susie_get_niter(SuSiE.fit).

To clarify, you are using reference panel from 1000 GP on your own data-set? Or you have your own WHR genotypes to use that matches your summary statistics? Sounds like the first case but just wondering because your filenames seem to suggest otherwise.

As for using reference LD, could you try this vignette: https://stephenslab.github.io/susieR/articles/finemapping_summary_statistics.html#using-ld-from-reference-panel and use the z_ld_weight parameter in susie_rss?. You can read on the page the motivations for it and what it does

aob93 commented 4 years ago

Hi @gaow!

Thanks for your prompt response.

We are using individual-level genotype data from our GWAS to calculate the LD matrix. This works perfectly most of the time.

At this locus, I'm looking at 1298 variants, which is much smaller than some of the other loci I've ran successfully.

The susie_get_niter() function does not seem to be available on my installation of susieR. Is this part of a later version? This is my current installation;

packageVersion('susieR') [1] ‘0.9.0’

gaow commented 4 years ago

@aob93 What's the reason you dont work directly with individual level data? Is that because your sample size is too large to run with susie() function? How bad is that for 1298 variants for this case in question -- what happens if you run the full data version susie() for it?

The susie_get_niter() function does not seem to be available on my installation of susieR

Sorry I forgot the export that function. For now could you use susieR:::susie_get_niter()?

aob93 commented 4 years ago

Usually we calculate LD using genotypes from only a subset of our GWAS phases due to some differences in the genotyping panels. So we are not using all of the individual genotype data available to us. susie() would probably not be the best option in this case.

In this particular analysis, we are using results from another GWAS not carried out by ourselves. I'm using an LD matrix generated rom our own GWAS as the populations have similar ancestry.

Looking through the available functions in susieR:::, susie_get_niter() does not appear on the list.

gaow commented 4 years ago

Thanks @aob93 in this case your genotype and summary stats dont match exactly. It is worth trying the link I suggested above.

Sorry for the confusions on our interface. For now please just run SuSiE.fit$niter and let us know ?

aob93 commented 4 years ago

Thanks again @gaow, I'll give that link a go.

It appears that this model converges in 5113 iterations.

pcarbo commented 4 years ago

@aob93 That's a lot of iterations. You might find that the result is very similar if you run a smaller number of iterations, e.g., 100. (If the result is very different, we would be interested in knowing that.)

aob93 commented 4 years ago

Hi @pcarbo !

After setting max_iter to 100, the command errors out with;

IBSS algorithm did not converge in 100 iterations!

pcarbo commented 4 years ago

That is a warning only.

aob93 commented 4 years ago

OK, so after 100 iterations, we get the following credible sets;

cs cs_log10bf cs_avg_r2 cs_min_r2 variable 1 15585.49 1 1 1008 2 17330.58 1 1 993 3 60993.49 1 1 1027 4 78942.74 1 1 1019 5 26730.04 1 1 996 6 16199.28 1 1 1028,1030

It seems to be a similar pattern to the previous attempt with some SNPs left out.

zouyuxin commented 4 years ago

Hi @aob93, since your LD matrix is computed using genotypes that doesn't exactly match summary stats, it's recommended using z_ld_weight in susie_rss. The details are in https://stephenslab.github.io/susieR/articles/finemapping_summary_statistics.html#using-ld-from-reference-panel

aob93 commented 4 years ago

OK thank you all for your help, very much appreciated! this should be plenty to keep me going!

hopedisastro commented 1 month ago

I'm having a similar problem where I am getting numerous credible sets (8-10) for most of the loci I'm testing genome-wide. Running SuSiE throws no errors/warnings. I've run kriging_rss and the plot looks fine (no points highlighted in red). I am using an in-sample LD and running susie_rss()

Is there any way to assess the credibility of the credible sets further? Can I use the z_ld_weight option, or is this advice not compatible with the latest version of SuSiE?

I've run step-wise conditional analysis on a couple of these loci and the results suggest that there are far fewer than 8 independent signals.

pcarbo commented 1 month ago

It is hard to say, there could be many explanations (or perhaps these results are real!). My first thought is population structure—is this a multi-ethnic or multi-ancestry cohort, and if so, did you account for population structure when computing the summary statistics (e.g., z-scores)?

stephenslab / susieR

Numerous CS with very large log10 BF #105