Unaligned SNPs and sumstat loading in LAVA

SBNeuro1 commented 5 months ago

Hello,

Thanks very much for producing this tool. It’s fantastic to have a method to check for genetic correlations across phenotypes. I have a few questions, but first a few details about my analysis:

Number of phenotypes: 15 Sumstat genome version: hg19 Ancestry: European Reference data set: g1000_eur Number of loci analysed: 2495 (genome-wide)

I made a list of all the genome wide significant (GWS) SNPs reported in the studies I got the sumstats from. There was a total of 1312 GWS SNPs across my 15 phenotypes of interest. I then identified the rsIDs that were dropped when loading the sumstats in LAVA (stored in input$unalignable.snps) and cross checked whether any of the GWS SNPs were dropped from the analysis. A total of 114 GWS SNPs were removed from the analysis which is 8.7% of all genome wide significant SNPs in my phenotypes of interest. I checked the chromosomal locations for these SNPs and they aren’t concentrated anywhere in particular. Is this normal or do I have some issue with reading the sumstats and aligning them to the reference data set?

I’ve also noticed an issue where some loci that contain aligned GWS SNPs for a condition do not give a significant univariate result. Is this normal/common and why might this be occurring?

I also found for one sumstat that almost every single locus had a significant univariate result (P < 2.00e-5), and many of these results had a P-value = 0. I have confirmed that this is European data, and I have also confirmed that the effect allele and reference allele columns have been labelled correctly. What might be going wrong with this sumstat?

Thanks very much

cadeleeuw commented 4 months ago

Hi,

With regard to the unaligned SNPs, the main reason for why this happens is because they have ambiguous alleles (ie. AT or CG), which at present LAVA cannot align to the reference data as it is not possible based on the allele coding itself (though we are looking into other solutions for this). Generally, around 10-15% of SNPs will fall into this category, so that 8.7% would be consistent with this.

It is indeed possible for a block that contains a significant SNP to not be significant on the univariate test, of course also depending on the threshold you use for that. The univariate test is a joint test of significance of all the SNPs in the block, which will typically be in the order of several thousand SNPs, and so it is possible that the signal of that significant SNP is drowned out by most of the other SNPs if those don't show any association. Generally we wouldn't expect this to happen too often (especially the signal disappearing completely, ie. univariate p > 0.05), but occasionally it can do. This tends to be more likely with associations in low MAF SNPs, since those tend not to show as much LD with other SNPs in the region.

As for that phenotype with almost universally significant univariate tests, it does indeed sound like something is very off there. If there are no obvious signs of alignment errors or LD mismatch though, it's hard to say without seeing the actual sumstats what might be wrong. If these are public sumstats, if you point me to where to find them I can take a closer look to see if I can determine the problem.

Best, Christiaan

SBNeuro1 commented 4 months ago

Hi Christiaan,

Thanks very much for your reply it has been very helpful! The ambiguous alleles makes sense so no worries about that anymore. Glad to hear that it is normal for some regions to not show univariate significance based on the presence of individual GWS SNPs. I had a look at some of the effect allele frequencies in my susmtats over the previous few days, and some of them do contain effect alleles that are rare. I might go back and check whether the regions that contain GWS SNPs have P < 0.05 because I have been using the genome-wide Bonferroni threshold (0.05/2495) to interpret the results so far.

As for the problematic sumstat, it can be found at this location: https://conservancy.umn.edu/items/ca7ed549-636b-41c0-ae79-97c57e266417. The specific susmtat I want to use is DrinkPerWeek.txt.gz. I still haven’t been able to resolve the issue with this one so any advice would be greatly appreciated.

Thanks again!

cadeleeuw commented 4 months ago

Hi,

I have looked at the summary statistics, and found the issue: unusually, the STAT column in this file contains chi-square statistics, rather than the t-statistics / z-statistics that are normally there (and that LAVA expects). By default, LAVA will use the STAT column in an input file and will interpret it as a z-statistic, so of course if that's actually a chi-square then that will go wrong.

I will make a note to add some functionality to LAVA to detect this, to prevent this issue in future. For now though, you can address it in your analyses by removing the STAT column, or just renaming it to eg. CHI_STAT. This will cause LAVA to use the beta and p-value columns instead (and same for any other sumstats from this same study you may want to use). I would recommend double-checking any other secondary analyses you performed using these sumstats, in case those also used the STAT column and interpreted it incorrectly.

Best, Christiaan

SBNeuro1 commented 4 months ago

Hi Christiaan,

Thanks very much for checking this out! I removed the STAT column in the sumstat and the P = 0 issue was immediately resolved.

josefin-werme / LAVA

Unaligned SNPs and sumstat loading in LAVA #75