stephenslab / susieR

R package for "sum of single effects" regression.
https://stephenslab.github.io/susieR
Other
176 stars 46 forks source link

Set log10bf to max if inf #201

Open jerome-f opened 11 months ago

jerome-f commented 11 months ago

Hi Peter,

I noticed that cs_log10bf can sometimes have inf in the values. Would be good to reset it to data type maximum.

Best Jerome

pcarbo commented 11 months ago

Thanks for the suggestion @jerome-f. The logBFs should always be finite. Do you have an example that produces an infinite logBF?

jmmax commented 8 months ago

I am having a similar problem to this. I have summary stats from a meta-analysis and 1000G EUR as the LD reference. I have used bigsnpr::snp_match to ensure no mismatched alleles. I originally was getting an error about non-convergence after 100 iterations. I then removed any snps that had low sample size (snp N is quite variable) and this stopped the convergence error. However, I now get cs_log10bf = Inf:

cs cs_log10bf cs_avg_r2 cs_min_r2 variable
1 Inf 1 1 645
2 Inf 1 1 674
3 Inf 1 1 675
4 Inf 1 1 671

Here is the susie plots:

image

image

And the diagnostic plot (lambda is 0.52) :

image

Any advice on this would be much appreciated? Is it likely that the reference panel is too small/meta-analysis is too heterogeneous for SuSie to work in this case?

pcarbo commented 8 months ago

@jmmax If there is any way you could share your data + code, it would help us to reproduce the problem of infinite BFs, and pinpoint the bug.

Regarding your specific fine-mapping results, removing the SNPs furthest away from the diagonal of the diagnostic plot might improve your results.

oborisov commented 2 months ago

Hi Peter,

I would ask here since my question is related to the infinite log10bg.

I am running susieR in a region with several very significant variants (p<1e-5000). As a result, I am obtaining 10 credible sets with an infinite "cs_log10bf" in 9 out of 10 sets (and each set consisted of a single variable). I used a completely matched in-sample LD matrix (same individuals, same variants). The genotypes are well-imputed and there are no missing values. I checked the sanity of my input data according to the SuSiE recommendations, the plot is attached - it looks good I assume? Screenshot from 2024-07-08 14-30-28

I ran susie_rss with verbose=TRUE. I saw that the "objective" in the output usually "converges" at some value quickly, e.g.:

[1] "objective:-56012.4505928347"
[1] "objective:-56009.8247200001"
[1] "objective:-56009.8188294932"
[1] "objective:-56009.818824447"

In my analysis, it was:

[1] "objective:-48181.7247065402"
...
[1] "objective:-17785.0571885817"
[1] "objective:-17564.5765219636"
[1] "objective:-17345.7019319673"
IBSS algorithm did not converge in 100 iterations!

I did some follow-up tests: I multiplied the standard errors of GWAS effect sizes by 10 and re-caclulated z-scores. In this case, susieR quickly converged and produced reasonable summary (1 credible set with cs_log10bf=50 and 30 variables). When trying "SE*3", I got 2 credible sets: one had infinite cs_log10bf but still 11 variables in it. With "SE*2", I got four credible sets, two of them with a single variable and infinite cs_log10bf.

Could it be that the infinite "cs_log10bf" is related to the very strong significance (i.e., very high z-scores)? If so, what could be a solution to make susieR produce finite "cs_log10bf" (tweak some options)? If not, maybe there are other things I could try?

Thanks, Oleg

stephens999 commented 2 months ago

if you are using in-sample LD then I would expect the observed and expected z scores to match exactly, not just approximately. (am i correct @pcarbo ?)

Matthew

On Mon, Jul 8, 2024 at 7:52 AM oleg2153 @.***> wrote:

Hi Peter,

I would ask here since my question is related to the infinite log10bg.

I am running susieR in a region with several very significant variants (p<1e-5000). As a result, I am obtaining 10 credible sets with an infinite "cs_log10bf" in 9 out of 10 sets (and each set consisted of a single variable). I used a completely matched in-sample LD matrix (same individuals, same variants). The genotypes are well-imputed and there are no missing values. I checked the sanity of my input data according to the SuSiE recommendations, the plot is attached - it looks good I assume? Screenshot.from.2024-07-08.14-30-28.png (view on web) https://github.com/stephenslab/susieR/assets/45593402/9190455b-9976-4b96-80a8-dd08055f891f

I ran susie_rss with verbose=TRUE. I saw that the "objective" in the output usually "converges" at some value quickly, e.g.:

[1] "objective:-56012.4505928347" [1] "objective:-56009.8247200001" [1] "objective:-56009.8188294932" [1] "objective:-56009.818824447"

In my analysis, it was:

[1] "objective:-48181.7247065402" ... [1] "objective:-17785.0571885817" [1] "objective:-17564.5765219636" [1] "objective:-17345.7019319673" IBSS algorithm did not converge in 100 iterations!

I did some follow-up tests: I multiplied the standard errors of GWAS effect sizes by 10 and re-caclulated z-scores. In this case, susieR quickly converged and produced reasonable summary (1 credible set with cs_log10bf=50 and 30 variables). When trying "SE3", I got 2 credible sets: one had infinite cs_log10bf but still 11 variables in it. With "SE2", I got four credible sets, two of them with a single variable and infinite cs_log10bf.

Could it be that the infinite "cs_log10bf" is related to the very strong significance (i.e., very high z-scores)? If so, what could be a solution to make susieR produce finite "cs_log10bf" (tweak some options)? If not, maybe there are other things I could try?

Thanks, Oleg

— Reply to this email directly, view it on GitHub https://github.com/stephenslab/susieR/issues/201#issuecomment-2213996033, or unsubscribe https://github.com/notifications/unsubscribe-auth/AANXRRIIGWQ5SURMATIDVKTZLKDRFAVCNFSM6AAAAAA525POL6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEMJTHE4TMMBTGM . You are receiving this because you are subscribed to this thread.Message ID: @.***>

pcarbo commented 2 months ago

Yes, if the everything is the same. See here for example. In any case the kriging plot looks pretty good (if not exactly on the diagonal) — I don't see major cause for concern.

Since the associations are very significant, it is possible that the effects are quite large, and perhaps fixing the prior variance (see the "prior_variance" parameter) would improve the behaviour of IBSS. It is worth a try, IMO.

stephens999 commented 2 months ago

i was concerned that if z scores are very large then even small deviations in the kriging plot could maybe be a problem.

On Mon, Jul 8, 2024 at 8:49 AM Peter Carbonetto @.***> wrote:

Yes, if the everything is the same. See here https://stephenslab.github.io/susieR/articles/susierss_diagnostic.html for example. In any case the kriging plot looks pretty good (if not exactly on the diagonal) — I don't see major cause for concern.

Since the associations are very significant, it is possible that the effects are quite large, and perhaps fixing the prior variance (see the "prior_variance" parameter) would improve the behaviour of IBSS. It is worth a try, IMO.

— Reply to this email directly, view it on GitHub https://github.com/stephenslab/susieR/issues/201#issuecomment-2214129939, or unsubscribe https://github.com/notifications/unsubscribe-auth/AANXRRPWNHARSONV4KBNOTTZLKKE7AVCNFSM6AAAAAA525POL6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEMJUGEZDSOJTHE . You are receiving this because you commented.Message ID: @.***>

pcarbo commented 2 months ago

Yes, that is possible. I don't think we have checked this.

oborisov commented 2 months ago

Thanks a lot, Matthew and Peter!

Will keep you posted!

Cheers, Oleg

oborisov commented 2 months ago

Hi again,

Here are the results of a few tests I did: 1) I took 4 SNPs with very significant p-value (<1e-5000), extracted the individual level data (approx N=40000), and respective phenotypes. I ensured that the genotype and phenotype matrix are matched (as in the test example for N3finemapping$X and N3finemapping$Y). Then I used the following code:

sumstats <- univariate_regression(raw_gt_mat, pheno_mat) # where raw_gt_mat is formatted as N3finemapping$X and pheno_mat is formatted as N3finemapping$Y
z_scores <- sumstats$betahat / sumstats$sebetahat
Rin = cor(raw_gt_mat)
attr(Rin, "eigen") = eigen(Rin, symmetric = TRUE)
condz_in = kriging_rss(z_scores, Rin, n=n)
condz_in$plot

I am attaching the resulting kriging plot. kriging

Could you think of any reasons by z-scores are still not at the identity line?

2) I tried to tweak the "prior_variance" parameter but it didn't help with the convergence.

Thanks for your help!

Best, Oleg

pcarbo commented 2 months ago

Hi @oborisov thanks for sharing. Because your z-scores are so large, this seems very much like an "ill-conditioned problem" where small numerical errors can lead to large differences. (There could be a simpler explanation that hasn't occurred to me.)

In any case this is a situation we haven't studied carefully, so this is interesting. If you could (somehow) share a small data set reproducing these errors/warnings, I'd be happy to take a look and perhaps I can pinpoint the cause.

oborisov commented 2 months ago

Thanks Peter!

I simulated some data that I hope can help to reproduce the case above.

I used the following shell code ("plink" corresponds to plink 1.9 (https://www.cog-genomics.org/plink/1.9/) and plink2 corresponds to plink 2.0 (https://www.cog-genomics.org/plink/2.0/))

echo "1 qtl 0.05 0.95  0.25 0" > wgas.sim
plink --simulate-qt wgas.sim --make-bed --out sim1 --simulate-n 100000
plink2 --bfile sim1 --export A --glm allow-no-covars --out sim2
cat sim2.PHENO1.glm.linear | column -t

It simulated 1 variant in 100,000 individuals with a very low p-value (p=4.55636e-6247). The simulated matrix of genotypes and phenotypes is attached (sim2.raw.gz). sim2.raw.gz

Then I used the following R code


### R code ###
library(susieR)
# function to create SNPs in strong LD
add_snp <- function(x, n_replace=1000) {
  ind_to_change <- sample(length(x), n_replace)
  x[ind_to_change] <- sample(x[ind_to_change])
  x
}
# read genotype+phenotype data
raw_gt <- read.table("sim2.raw.gz", header=T)
# add 10 more variants with strong LD
variant_name <- grep("qtl", colnames(raw_gt), value=T)
for (i in 1:10) {
  set.seed(i+1234)
  raw_gt[[paste0("qtl_", i)]] <- add_snp(raw_gt[[variant_name]], n_replace=1000)
}
# remove the first variant. It can also be kept, but then only 1 credible set is identified. If removed, there are 2-3 CSs that better illustrate the case above.
raw_gt[[variant_name]] <- NULL
# create phenotype matrix (one variable)
pheno_mat <- as.matrix(raw_gt[,c("PHENOTYPE")])
# create genotype matrix
raw_gt_mat <- raw_gt[,7:ncol(raw_gt)]
n <- nrow(raw_gt_mat)
# calcualte z-scores
sumstats <- univariate_regression(raw_gt_mat, pheno_mat)
z_scores <- sumstats$betahat / sumstats$sebetahat
# correlation matrix
Rin = cor(raw_gt_mat)
attr(Rin, "eigen") = eigen(Rin, symmetric = TRUE)

Correlation matrix (first 4 variables) qtl_1 qtl_2 qtl_3 qtl_4 qtl_1 1.0000000 0.9807084 0.9807084 0.9814978 qtl_2 0.9807084 1.0000000 0.9797956 0.9805111 qtl_3 0.9807084 0.9797956 1.0000000 0.9803630 qtl_4 0.9814978 0.9805111 0.9803630 1.0000000

# kriging plot
condz_in = kriging_rss(z_scores, Rin, n=n)
condz_in$plot
dev.off()

The plot is attached. kriging

Finally, I ran fine-mapping

# susie_rss
fitted_rss <- susie_rss(z = z_scores,
                        n = n,
                        R = Rin)
summary(fitted_rss)

Warning message: In susie_suff_stat(XtX = XtX, Xty = Xty, n = n, yty = (n - 1) * : IBSS algorithm did not converge in 100 iterations! Please check consistency between summary statistics and LD matrix. See https://stephenslab.github.io/susieR/articles/susierss_diagnostic.html

Variables in credible sets:

variable variable_prob cs 4 0.9999999 1 8 0.9999955 2 3 0.5724716 3 1 0.3996082 3

Credible sets summary:

cs cs_log10bf cs_avg_r2 cs_min_r2 variable 1 Inf 1.000000 1.000000 4 2 Inf 1.000000 1.000000 8 3 214.1423 0.961789 0.961789 1,3

Thanks again, Oleg

pcarbo commented 2 months ago

Wow @oborisov thanks for sharing this. I will try to reproduce this issue soon once I have access to my computer again.

oborisov commented 2 months ago

Thanks Peter!

By the way, in the toy example above I additionally tried the susie() method with the simulated individual level data which gave me cs_log10bf=Inf. Using the real data, susie seems to converge much faster than susie_rss and give different output (more "reasonable"?): e.g., susie_rss is giving 10 CSs each with one variable and all with cs_log10bf=Inf, while susie is giving six CSs with different number of variables (from 1 to 15) and only two have cs_log10bf=Inf while other four have finite values. (Maybe it would be easier to reproduce the behavior with susie rather than susie_rss.)

Thanks again, Oleg

# susie
raw_gt_mat <- sapply(raw_gt_mat, as.numeric)
fitted <- susie(raw_gt_mat, as.matrix(pheno_mat))
summary(fitted)

Variables in credible sets:

variable variable_prob cs 4 1.0000000 1 8 0.9999963 2 1 0.5906970 4 3 0.4084430 4

Credible sets summary:

cs cs_log10bf cs_avg_r2 cs_min_r2 variable 1 Inf 1.000000 1.000000 4 2 Inf 1.000000 1.000000 8 4 Inf 0.961789 0.961789 1,3

pcarbo commented 2 months ago

That's interesting, thank you for sharing. (It isn't unexpected that the log-Bayes factor is Inf.)

oborisov commented 2 months ago

Thanks for the reply! I initially thought that the cs_log10bf should always be finite. https://github.com/stephenslab/susieR/issues/201#issuecomment-1757740387 So even if we get cs_log10bf=Inf, we can still consider the identified CSs as reliable, right? (Given that the input is correct, i.e., the individual level data was used so there can be no inconsistency between the LD matrix and sumstats). Actually, in my analyses, I tried to use these CSs (with Inf cs_log10bf) in the downstream colocalization (coloc.susie) which produced reasonable and expected results.

jerome-f commented 2 months ago

@oborisov coloc.susie uses lbf to actually compare two credible sets. So lbf being inf how did that work out. Did you change inf to some finite value before running coloc.susie ?? Check the coloc.bf_bf function.

oborisov commented 2 months ago

Thanks @jerome-f for your reply. I was observing Inf values when applying summary method to the fitted object (output of susie()), as in the code in my message above. I now looked into the source code of summary.susie() function. These "Inf" values are likely produced in line 35, namely by log10(exp(object$lbf[cs$cs[i]])) https://github.com/stephenslab/susieR/blob/master/R/summary.susie.R#L35C12-L35C22 I looked into the actual lbf values and they were finite (fitted$lbf).

Next, if I see correctly, coloc.bf_bf takes lbf per variable, i.e., object$lbf_variable https://github.com/chr1swallace/coloc/blob/main/R/susie.R#L91-L95 I checked fitted$lbf_variable matrix and there were no Inf values. Therefore, coloc.susie should work fine I assume.

jerome-f commented 2 months ago

Ah! That makes sense. I got confused with lbf of CS vs lbf of variable. My bad.

pcarbo commented 2 months ago

@oborisov Apologies for the delay — I believe I was able to reproduce your results in your toy example. This was the script I ran (with a few small changes to your code):

library(susieR)

# function to create SNPs in strong LD
add_snp <- function(x, n_replace = 1000) {
  ind_to_change    <- sample(length(x), n_replace)
  x[ind_to_change] <- sample(x[ind_to_change])
  return(x)
}

# read genotype+phenotype data.
raw_gt <- read.table("sim2.raw.gz", header = TRUE)

# Add 10 more variants with strong LD.
variant_name <- grep("qtl", colnames(raw_gt),value = TRUE)
for (i in 1:10) {
  set.seed(i + 1234)
  raw_gt[[paste0("qtl_",i)]] <- add_snp(raw_gt[[variant_name]],
                                        n_replace = 1000)
}

# Remove the first variant. It can also be kept, but then only 1
# credible set is identified. If removed, there are 2-3 CSs that
# better illustrate the case above.
raw_gt[[variant_name]] <- NULL

# create phenotype matrix (one variable)
pheno_mat <- raw_gt[,"PHENOTYPE"]

# create genotype matrix
n <- nrow(raw_gt)
p <- ncol(raw_gt)
raw_gt_mat <- as.matrix(raw_gt[,7:p])
storage.mode(raw_gt_mat) <- "numeric"

# Run susie.
fitted <- susie(raw_gt_mat,pheno_mat,verbose = TRUE)
all(is.finite(fitted$lbf_variable))
summary(fitted)

It turns out that the log10 Bayes factors in the summary function are computed in a very silly way, and with the fix (see the latest version on GitHub) the log10BFs are now finite:

> summary(fitted)
Variables in credible sets:
 variable variable_prob cs
        4     1.0000000  1
        8     0.9999963  2
        1     0.5906970  4
        3     0.4084430  4
Credible sets summary:
 cs cs_log10bf cs_avg_r2 cs_min_r2 variable
  1  3098.6552  1.000000  1.000000        4
  2   779.3524  1.000000  1.000000        8
  4   661.2822  0.961789  0.961789      1,3

I also noticed that indeed the IBSS algorithm was making very slow progress on this data set, and I suspect that the IBSS algorithm has difficult due to the very strong LD and/or the very strong effects with PIPs very close to 1. In other words this sort of slow convergence is expected here.

So at least this one problem has been solved (but it doesn't sound like this is the main problem you are struggling with).

pcarbo commented 2 months ago

Next running susie_rss,

# Calculate z-scores.
sumstats <- univariate_regression(raw_gt_mat, pheno_mat)
z_scores <- sumstats$betahat / sumstats$sebetahat

# LD matrix.
Rin <- cor(raw_gt_mat)
attr(Rin,"eigen") <- eigen(Rin,symmetric = TRUE)

# Run susie_rss.
fitted_rss <- susie_rss(z = z_scores,n = n,R = Rin,verbose = TRUE)
# Calculate z-scores.
sumstats <- univariate_regression(raw_gt_mat, pheno_mat)
z_scores <- sumstats$betahat / sumstats$sebetahat

# LD matrix.
Rin <- cor(raw_gt_mat)
attr(Rin,"eigen") <- eigen(Rin,symmetric = TRUE)

# Run susie_rss.
fitted_rss <- susie_rss(z = z_scores,n = n,R = Rin,verbose = TRUE)
summary(fitted_rss)

I get this:

> summary(fitted)
Variables in credible sets:
 variable variable_prob cs
        4     1.0000000  1
        8     0.9999963  2
        1     0.5906970  4
        3     0.4084430  4
Credible sets summary:

 cs cs_log10bf cs_avg_r2 cs_min_r2 variable
  1  3098.6552  1.000000  1.000000        4
  2   779.3524  1.000000  1.000000        8
  4   661.2822  0.961789  0.961789      1,3

The susie and susie_rss results look quite consistent here (except for the BFs).

In this case I suspect that the slow convergence is due to the LD structure, and not due to inconsistency between the LD and z-scores. (There are several reasons why IBSS should be slow to converge, and inconsistencies among the summary statistics is only one possible reason.)

pcarbo commented 2 months ago

This thread is quite long and I lost track of the conversation — if there are other important issues that remain unresolved, please remind me!

oborisov commented 2 months ago

Thanks Peter!

My question above is now resolved.

To follow-up things that might useful for the package: 1) I tried my code above with the updated susieR version. I am getting exactly the same results as you when running susie(). When running susie_rss(), I am getting different numeric output but with the same variables selected (I guess that's what you meant by saying "except for the BFs"). I guess these are numeric differences and the results are essentially same.

Also, susie_rss() was faster than susie().

Variables in credible sets:

 variable variable_prob cs
        4     0.9999999  1
        8     0.9999955  2
        3     0.5724716  3
        1     0.3996082  3

Credible sets summary:

 cs cs_log10bf cs_avg_r2 cs_min_r2 variable
  1   813.6982  1.000000  1.000000        4
  2   948.8904  1.000000  1.000000        8
  3   214.1423  0.961789  0.961789      1,3
  1. I am still unsure why the kriging_rss plot (above) did not show identical expected and observed z-scores.

These both points are not critical to me anymore, just mentioning them out of curiosity.

Thanks again for your work and best wishes, Oleg

pcarbo commented 1 month ago

Thanks @oborisov for the feedback.

susie_rss is expected to be faster than susie when n, number of individuals in the data set, is larger than p, the number of SNPs.

As for the differences in the Bayes factors between susie and susie-rss, note that susie-rss applied to the z-scores implicitly makes certain assumptions about the analysis, and these assumptions are not necessarily made by susie (e.g., X and y are standardized); see our PLoS Genetics paper for these details. By contrast, if you run susie_rss with bhat and shat instead of the z-scores, the results should be exactly the same (up to numerical differences).

I haven't looked at this very carefully, but overall my sense is that the differences in the numbers between the X and Y axes in your latest kriging plot are small relatively speaking. It is also possible that the tests do not work as well for larger z-scores. We haven't studied the behaviour of these tests very carefully in this setting, so I can't say for sure.

pcarbo commented 1 month ago

Also please see this vignette which illustrates the different variants of susie_rss.

stephens999 commented 1 month ago

I'm not sure I would expect IBSS to be slow just because of strong LD and/or PIPs close to 1. So maybe it is worth looking at this further (I hope I might get time to do this eventually!)

On Tue, Aug 6, 2024 at 4:43 PM Peter Carbonetto @.***> wrote:

Also please see this vignette https://stephenslab.github.io/susieR/articles/susie_rss.html which illustrates the different variants of susie_rss.

— Reply to this email directly, view it on GitHub https://github.com/stephenslab/susieR/issues/201#issuecomment-2271470647, or unsubscribe https://github.com/notifications/unsubscribe-auth/AANXRRIS57I2ALNV74UZZ6TZQDOJVAVCNFSM6AAAAAA525POL6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDENZRGQ3TANRUG4 . You are receiving this because you commented.Message ID: @.***>

pcarbo commented 1 month ago

@stephens999 This is odd:

> fitted <- susie(raw_gt_mat,pheno_mat,L = 4,verbose = TRUE)
> round(fitted$alpha,digits=3)
     qtl_1 qtl_2 qtl_3 qtl_4 qtl_5 qtl_6 qtl_7 qtl_8 qtl_9 qtl_10
[1,] 0.000     0 0.000     1     0     0     0     0 0.000      0
[2,] 0.000     0 0.000     0     0     0     0     1 0.000      0
[3,] 0.000     0 0.000     1     0     0     0     0 0.000      0
[4,] 0.591     0 0.408     0     0     0     0     0 0.001      0
> round(fitted$mu,digits=3)
      qtl_1  qtl_2  qtl_3  qtl_4  qtl_5  qtl_6  qtl_7  qtl_8  qtl_9 qtl_10
[1,]  0.324  0.325  0.324  0.328  0.324  0.325  0.324  0.322  0.325  0.324
[2,]  0.163  0.164  0.163  0.162  0.164  0.164  0.164  0.165  0.164  0.164
[3,] -0.138 -0.137 -0.137 -0.142 -0.137 -0.137 -0.137 -0.139 -0.136 -0.137
[4,]  0.152  0.151  0.152  0.149  0.151  0.151  0.151  0.149  0.152  0.151

For some reason, susie identified two CSs for SNP 4, and the single effect estimates are of opposite sign. I think this is the source of the convergence difficulty.

jerome-f commented 1 month ago

Hmm... That's potentially a bug right. It violates the "single" assumptions of single effect ?

pcarbo commented 1 month ago

It could be a bug — I will look into it.

stephens999 commented 1 month ago

It is not necessarily a bug - it doesn't violate the single effect assumption because we don't impose a constraint that the different single effects are not the same snp.

But this is unusual behavior, probably representing convergence to a poor local optimum, and could be worth examining closer.

On Thu, Aug 15, 2024, 20:30 Peter Carbonetto @.***> wrote:

It could be a bug — I will look into it.

— Reply to this email directly, view it on GitHub https://github.com/stephenslab/susieR/issues/201#issuecomment-2291921097, or unsubscribe https://github.com/notifications/unsubscribe-auth/AANXRROYGP65QV4X6KLMGMTZRTXUBAVCNFSM6AAAAAA525POL6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEOJRHEZDCMBZG4 . You are receiving this because you were mentioned.Message ID: @.***>

pcarbo commented 1 month ago

I believe this is an optimization issue, rather than a "bug" — I suspect that if you run the algorithm for "long enough", it might eventually converge to a better solution (but that might take days!). The current algorithm has difficulty dealing with the combination of many strong effects that are also very strongly correlated with each other. This doesn't happen often, which is probably why we haven't noticed it before, but it makes sense now that I see it.

(And to be clear, the same issue arises with susie_rss.)

I tried with refine = TRUE, and it didn't fix the problem.

More concretely, what it happening I believe is that it is overestimating the size of the first effect, and then compensates later by introducing a negative effect. And once it does this, it gets "stuck" in this local optimum. All the SNPs are very strongly correlated, so it doesn't make sense that the first single effect (the first row here) is very different from the others:

> round(fitted$mu,digits = 3)
      qtl_1  qtl_2  qtl_3  qtl_4  qtl_5  qtl_6  qtl_7  qtl_8  qtl_9 qtl_10
[1,]  0.338  0.339  0.338  0.343  0.339  0.339  0.338  0.336  0.339  0.338
[2,]  0.161  0.162  0.161  0.160  0.162  0.162  0.162  0.163  0.162  0.162
[3,] -0.154 -0.152 -0.153 -0.158 -0.152 -0.153 -0.153 -0.155 -0.152 -0.153
[4,]  0.155  0.155  0.155  0.152  0.155  0.155  0.154  0.152  0.155  0.154

So this suggests that there is room for improvement in the optimization algorithm.

stephens999 commented 1 month ago

I agree it's an optimization issue. (I don't think it is weird for the first single effect to look very different from the subsequent single effects though; it would be weird if they looked the same, because the subsequent single effects are supposed to capture signals not captured in the first single effect.) It's a little surprising to me that refine=T did not help here.

The following sequential procedure seems to help (at least achieves a better optimum and avoids including the same variable in two different CSs):

fitted.1 <- susie(raw_gt_mat,pheno_mat,verbose = TRUE,L = 1)
fitted.2 <- susie(raw_gt_mat,pheno_mat,verbose = TRUE,s_init = fitted.1, L = 2)
fitted.3 <- susie(raw_gt_mat,pheno_mat,verbose = TRUE,s_init = fitted.2, L = 3)
fitted.4 <- susie(raw_gt_mat,pheno_mat,verbose = TRUE,s_init = fitted.3, L = 4)
summary(fitted.4)

I don't know if it will help in general - it may increase computation in general...

pcarbo commented 1 month ago

This example illustrates the limitations of coordinate ascent.