Difference of SuSiE performance between WGS genotype and imputed genotype

RL-m commented 1 year ago

Hello SuSiE developers,

I used SuSiE to run simulations and found that SuSiE performed differently between WGS genotype imputed genotype. I designed 2 sets of simulations: 1) I used real WGS data as genotype and simulated phenotype in similar way as mentioned in SuSiE paper section 4. 2) I selected partial SNPs (SNPs on UKB Axiom array) from WGS data and imputed them to HRC reference panel as genotype (as mentioned in this paper). Then I simulated phenotype in the same way as 1). In both scenarios I set one single effect variable and I change the variance explained from 0.004~0.8 with a sample size of 8853. However, with larger variance explained (>0.016), the recall of SuSiE results decreased in scenario 2) while the recall of scenario 1) is as expected. I also ran another fine-mapping software FINEMAP, and the results are very similar as SuSiE.

I couldn't think of a proper explanation for this finding and I checked your simulations in the paper, it seems that you used the same genotype as scenario 1). Do you have any idea why imputed genotype would cause these results?

gaow commented 1 year ago

@RL-m Could you clarify how did you define "recall" -- if a signal is captured by a 95% CS? Also, in our paper we focused on variants with MAF > 5% whereas in Wu et al the other paper you cited, they included rare variants. Is the majority of variants also rare in your wgs-imp simulation?

RL-m commented 1 year ago

Sorry I missed those information. Here I defined "recall" as the "power" in SuSiE paper -- proportion of causal captured by a 95% CS and I only used common variants (MAF > 1%) in my simulation.

pcarbo commented 1 year ago

It is reassuring that susie and finemap show similar trends for the imputed SNPs.

Yes, as Gao said, in your results you should try breaking down the precision/recall by SNP allele frequency.

stephens999 commented 1 year ago

What is perhaps most weird is that recall decreases with h^2 for the imputed simulations. That seems very wrong. It may not be hard to diagnose the problem because from the non-imputed results the recall should be almost 100% for h^2 that large.... I would suggest looking in detail at what is going on for 1-2 of the simulated datasets with large h^2 and imputed genotypes. For example, is the true causal SNP the one with the largest or near-largest z score?

Matthew

On Mon, Oct 24, 2022 at 8:47 AM Peter Carbonetto @.***> wrote:

It is reassuring that susie and finemap show similar trends for the imputed SNPs.

Yes, as Gao said, in your results you should try breaking down the precision/recall by SNP allele frequency.

— Reply to this email directly, view it on GitHub https://github.com/stephenslab/susieR/issues/174#issuecomment-1289063997, or unsubscribe https://github.com/notifications/unsubscribe-auth/AANXRRIBIYLN5734QLHE6F3WE2HQ5ANCNFSM6AAAAAARMY7T3A . You are receiving this because you are subscribed to this thread.Message ID: @.***>

pcarbo commented 1 year ago

@RL-m Are you running susie or susie_rss?

RL-m commented 1 year ago

I ran susie_rss with in-sample LD.

RL-m commented 1 year ago

@stephens999 I checked if causal SNP is the top SNP (with the largest Chi^2) in each simulation. Here is the result. y-axis is the proportion of simulations where causal SNP has the largest Chi^2. For WGS genotype, almost all the causal SNPs are "top SNP" under the largest h^2, while for imputed genotype, 70% of causal SNPs are "top SNP". For h^2 less than 0.016 (Non-centrality parameter less than 142), the difference is not very obvious.

RL-m commented 1 year ago

To add to my question, I think it's expected that z-score of causal SNP in imputed genotype has huge difference from causal SNP in WGS genotype under large h^2 (imputation error may be a reason). But what I don't understand is the decrease of recall in fine-mapping results when h^2 is large in imputed simulations. Why didn't fine-mapping power have the same trend as GWAS power?

stephens999 commented 1 year ago

are you simulating data with real genotypes and then analyzing it with imputed genotypes?

That is simulating Y = X_real b + E and analyzing using Y = X_impute b + E

where X_impute \approx X_real but not equal?

On Mon, Oct 24, 2022 at 8:45 PM RL-m @.***> wrote:

To add to my question, I think it's expected that z-score of causal SNP in imputed genotype has huge difference from causal SNP in WGS genotype under large h^2 (imputation error may be a reason). But what I don't understand is the decrease of recall in fine-mapping results when h^2 is large in imputed simulations. Why didn't fine-mapping power have the same trend as GWAS power?

— Reply to this email directly, view it on GitHub https://github.com/stephenslab/susieR/issues/174#issuecomment-1289866894, or unsubscribe https://github.com/notifications/unsubscribe-auth/AANXRRKLF6WGD4QFAHCGBLTWE43T3ANCNFSM6AAAAAARMY7T3A . You are receiving this because you were mentioned.Message ID: @.***>

RL-m commented 1 year ago

@stephens999 Yes, that is my simulation design. I used Y= X_real b + E to mimic real phenotype and Y = X_imputed b + E to mimic array-imputed genotype employed in common GWAS analysis.

stephenslab / susieR

Difference of SuSiE performance between WGS genotype and imputed genotype #174