stephenslab / susieR

R package for "sum of single effects" regression.
https://stephenslab.github.io/susieR
Other
169 stars 42 forks source link

kriging_rss plot looks very different despite same processing #207

Open nickhir opened 7 months ago

nickhir commented 7 months ago

Hi all, I have downloaded GWAS data from two traits for one region that I would like to finemap. Because we do not have access to the individual level genotype data, I am using the 1000g reference panel to calculate LD between my SNP markers. I am using a function to align/harmonize my alleles (and betas) so that they match the Ref/Alt allele from the 1000g panel. To make sure that this works correctly, I am using the kriging_rss function provided by susieR. For the first trait, this looks good, and it seems like there are no allele switch issues except for one SNP (see plot below, trait1). However, for the second trait, which was processed with exactly the same pipeline, the plot looks very "off" (see plot below, trait2). kriging_rss

I am particularly surprised that I am not observing any "expected values" > |3| (which was the case for trait 1). Also, despite obvisouly looking worse, trait2 does not contain any "red" points which I thought would indicate outliers. I have attached a picture of the LD matrix and distribution of Zscores for the SNPs below, and just by visual inspection they do not seem extremely different. Also the sample size is very similar, so I am confused why I am observing such different plots. LD_matrix zscores

Furthermore, I noticed that I get the following warning when running kriging_rss:

WARNING: The matrix R is not positive semidefinite. Negative eigenvalues are set to zero.
WARNING: The matrix R is not positive semidefinite. Negative eigenvalues are set to zero

This surprised me a little, because I am generating my LD matrix using plink --r, which I thought was the intended way to create the signed LD matrix.

I would very much appreciate any insights! Cheers!

pcarbo commented 7 months ago

If the GWAS summary statistics are from the same source, then I agree these results are surprising.

Could you please explain what are the differences between the two LD matrices? Are they for different SNPs? In principle one should be able to do use the same LD matrix.

Also please see the PLoS Genetics paper about interpreting the results; a point far away suggests a genotyping inconsistency, not necessarily an allele flip.

nickhir commented 7 months ago

Thank you very much for your quick answer! I realized that I made a mistake when converting Z scores to betas and standard errors for the second dataset. Essentially, for some SNPs I was using the "wrong" MAF. Once I fixed this, I got a very good looking kriging_rss$plot, that did not show any obvious outliers. However, now when I run susie, I get 10 credible sets, which seems quite a lot for a region which is only 250kb and only contains ~1k SNPs. Below is a locus zoom plot where I have highlighted the identified CS. Do you think that 10 credible sets are reasonable given the comparetively low LD this region shows or do you think something might have gone wrong? I am particularly concerend with credible set 5, which is made up of only 1 SNP that has a p value much greater than 10^-8. I read in the PLoS Genetic paper, that outliers in the kriging_rss$plot might lead to an unexpectedly high number of credible sets, but I am quite certain that I have removed all of them.

LZ

I very much appreciate any insights or pointers!

pcarbo commented 7 months ago

If you have many SNPs with very small p-values that are not in strong LD with each other, then this is expected. (And in fact there may be more than 10 CSs — try setting L to a larger value.)

CS 5 does seem a little odd — I would check it carefully. Perhaps running with refine = TRUE will improve the result.

nickhir commented 7 months ago

Thank you for your answer! I already tried including refine=TRUE, but this did not change the results. Can you elaborate what you mean by "carefully checking" CS 5? Because other than checking the genomic features that this SNP overlaps with, I dont have any more ideas..

pcarbo commented 7 months ago

It may depend on the data that are available to you, but I think the larger point is that because you are working with summary statistics from a separate study, and you are using LD estimates that approximate the "true" LD, there are several possible sources of error in your fine-mapping analyses, so one should treat all fine-mapping results with some level of caution. (This is a caveat that applies to many/most fine-mapping analyses that use summary data.)

koujiaodahan commented 6 months ago

Thank you very much for your quick answer! I realized that I made a mistake when converting Z scores to betas and standard errors for the second dataset. Essentially, for some SNPs I was using the "wrong" MAF. Once I fixed this, I got a very good looking kriging_rss$plot, that did not show any obvious outliers. However, now when I run susie, I get 10 credible sets, which seems quite a lot for a region which is only 250kb and only contains ~1k SNPs. Below is a locus zoom plot where I have highlighted the identified CS. Do you think that 10 credible sets are reasonable given the comparetively low LD this region shows or do you think something might have gone wrong? I am particularly concerend with credible set 5, which is made up of only 1 SNP that has a p value much greater than 10^-8. I read in the PLoS Genetic paper, that outliers in the kriging_rss$plot might lead to an unexpectedly high number of credible sets, but I am quite certain that I have removed all of them.

LZ

I very much appreciate any insights or pointers!

I met a bad consistency too, could you tell me what the "wrong maf" means?

nickhir commented 6 months ago

The dataset I used provided a seperate file with MAF for each of the SNPs it contained. However, as part of my preprocessing, I aligned all my alleles according to the 1000g VCF file. This meant that for some of the SNPs REF and ALT alleles got flipped. I then made the mistake of using the MAF file which was provided by the author (and corresponded to the "unflipped" dataset). Essentially, this caused the MAF to be wrong for all the alleles that got flipped during my preprocessing (if allele gets flipped the correct MAF would be "1-reported_MAF"). I hope that makes sense!