Slope differences between standard deviations from LD-ref and from summary statistics

johannfaouzi commented 3 years ago

Hi,

I’m working with the summary statistics of a GWAS on Parkinson’s disease. In this meta-analysis, they included true cases (individuals with Parkinson’s disease, proxy-cases from UK BioBank (individuals without Parkinson’s disease but with a first-degree relative having Parkinson’s disease) and controls. The summary statistics provide, among others, the following columns: beta, beta_se, p, n_cases, n_controls.

When doing QC, it seems like there is a slight difference for the slope for a few clusters of sd_ldref and sd_ss values. I checked the values of beta_se provided in the summary statistics and they seem to match the theoretical approximation values. I also plotted the number of cases and controls and there are a few clusters. Finally, I played with some fake values for the numbers of cases and controls and can get a better match by decreasing the number of cases. You can find all the plots attached.

All of this makes me think that the number of cases provided in the summary statistics is overestimated because it includes true cases and proxy-cases. Unfortunately I have no way of determining the number of true cases for each SNP (but I have the total numbers of true cases and proxy cases). I’m wondering what approach you would recommend me to deal with this issue.

Thank you once again for your help, it’s very much appreciated!

sd QC with provided numbers of cases and controls

beta_se QC

Number of cases and controls

sd QC with fake numbers of cases and controls

privefl commented 3 years ago

Let's talk about your first figure.

The cluster from the top seems to be the normal one, ok.
I have seen something similar to the cluster from the bottom right when summary statistics came from a linear mixed model (instead of logistic regression for the other variants, all being mixed in the same summary statistics....).
The last cluster on the left, I have no idea what this is.

privefl commented 3 years ago

In this particular scenario, I think I would split the variants in two groups based on the two groups of sample sizes, run LDpred2 on both groups separately, and stack the results (e.g. linear reg of the two best models in validation set).

johannfaouzi commented 3 years ago

Thank you very much!

Indeed, I think the summary statistics come from a linear mixed effect model:

This analysis used fixed-effects meta-analyses as implemented in METAL to combine summary statistics across all sources.

Moreover, given the first figure in the article, there seems to be 4 studies in the meta-analysis (IPDGC-Neurox, UKBB, IPDGC and SGPD, the other ones are not included in the publicly available summary statistics):

Unfortunately, the sample sizes are very unbalanced, so I'm not sure that I would be able to accurately split the variants. I should at least be able to determine the variants in the UKBB GWAS given its much larger sample size.

Moreover, this will not solve the QC filtering (sd_ss < (0.5 * sd_ldref) | sd_ss > (sd_ldref + 0.1)) because all the variants being too far away from the first bisector (identity function) will still be removed (which is 95% of the variants for the moment). So I'm still afraid that the proxy cases being classified as true cases is an issue (I would guess that this corresponds to the large cluster with the lowest slope on the first figure).

privefl commented 3 years ago

The split could be either on sample size, at e.g. 10^5 controls, or on sd_ldref at 0.5 * sd_ss.

Are these two splits the same? How many variants are in each of the splits?

For the meta-analysis, are they also talking about some mixed-effects meta-analysis? (e.g. for the variants with some high heterogeneity)

johannfaouzi commented 3 years ago

The split could be either on sample size, at e.g. 10^5 controls, or on sd_ldref at 0.5 * sd_ss. Are these two splits the same? How many variants are in each of the splits?

They are almost the same splits:

1,012,063 variants with > 100k controls
1,007,072 variants with sd_ss < (0.5 * sd_ldref)
1,007,068 variants with > 100k controls & sd_ss < (0.5 * sd_ldref)

For the meta-analysis, are they also talking about some mixed-effects meta-analysis? (e.g. for the variants with some high heterogeneity)

I looked briefly at the article and found this, I'm not sure if this answers your question:

We instituted two filters after fixed-effects and conditional and joint analyses, excluding variants that had a random-effects p value across all datasets more than 4·67 × 10–⁴ and a conditional analysis p value more than 4·67 × 10–⁴ using participant level 23andMe genotype data.

From the supplementary materials:

Conditional-joint analysis to nominate variants of interest To nominate variants of interest, we employed a conditional and joint analysis strategy (GCTA- COJO, http://cnsgenomics.com/software/gcta/) as a means to algorithmically identify variants that best account for the heritable variation within and across nearby loci(22). This is particularly useful in scenarios where only basic summary statistics are available for a majority of samples in a meta-analysis and additional participant level analyses are logistically prohibitive. For this analysis, we used the full meta-analysis summary statistics in conjunction with the largest single site collection of HRC-level imputed PD and control data as a reference for linkage disequilibrium patterns in the conditional-joint workflow (described below and in the Methods Supplement). Using the consensus IPDGC data cleaning and imputation workflow, we assembled a reference set of 17,188 cases and 22,875 controls at variants overlapping with the locus discovery analysis results that passed quality control on average in 74.3% of samples incorporating soft call genotypes at a minimum imputation quality of 0.30. This set includes data from all samples described in Supplementary Table S1, except the 23andMe post-Chang et al., SGPD, UK PDMED and UKB sample series. This aggregate dataset included previously described samples series such as the Dutch GWAS, German, UK and US IPDGC series plus the NeuroX-dbGaP and Myers-Faroud datasets from dbGaP(18,23,24). We assembled this large PD-specific LD reference to better ascertain LD patterns at PD loci, particularly the LRRK2 and GBA regions where rares risk variants are located. The COJO analysis was run using default analysis parameters including a significance threshold of P < 5E-8 and a window specification of 1 megabase. Additional analyses described below were utilized to further scrutinize putative associated variants and account for possible differential linkage disequilibrium (LD) signatures in multiple ways, including utilizing the massive single site reference data from 23andMe in further conditional analyses. If a variant nominated during the COJO phase of analysis was greater than 1 megabase from any of the genome-wide significant loci nominated in Chang et al. 2017, we considered this to be a novel locus.

Additional filtering of nominated variants We instituted two additional filters after fixed-effects and COJO analyses. These additional filters exclude variants that 1) had a random-effects P value across all datasets > 4.67E-04 and 2) a conditional analysis P > 4.67E-04 using participant level 23andMe genotype data. This Bonferroni multiple testing threshold is based on up to 107 nominated variants at this stage of filtering, of which 90 passed these criteria. Random-effects meta-analysis P values were generated under the residual maximum likelihood method using the R package metafor(25). Forest plots for all loci of interest are available in the Supplemental Appendix. Conditional analyses were carried out using 23andMe pooled data analyses including all available 23andMe data (from Nalls et al. 2014, PDWBS and the post-Chang et al. 2017 datasets combined). For the participant level conditional analyses in 23andMe, all nominated variants per chromosome were included in a single logistic regression model with appropriate covariates, then parameter estimates per variant were extracted. Conditional analyses on a per chromosome interval instead of a locus or megabase interval should adjust for possible longer range LD associations. For more information on variant filtering, please see Supplementary Table S2 summarizing all variants nominated. We defined nominated risk variants as sharing a single locus if they are within +/- 250kb of each other.

I attached the article and the supplementary materials if you want to have a look, but you already did a lot, so do not feel forced to do so.

Nalls et al. - 2019 - Identification of novel risk loci, causal insights.pdf

1-s2.0-S1474442219303205-mmc1-2.pdf

privefl commented 3 years ago

So, most variants have ncontrols > 1e5? Then, just remove the other ones.

Then, I think it is a matter of properly estimating n_eff as sum(n_eff) instead of computing n_eff with sum(n_controls) and sum(n_cases). Then look again at the QC plot if it aligns well.

johannfaouzi commented 3 years ago

Thank you! Yes, most variants have a large number of controls (meaning that UKBB is included). I removed the other variants.

I did not understand what you meant by:

properly estimating n_eff as sum(n_eff) instead of computing n_eff with sum(n_controls) and sum(n_cases)

Do you think that I should use a single value for n_eff for all the variants instead of a value for each variant? I looked at the distribution of (number of controls, number of cases) pairs, there are 115 unique ones. Here is the table for the most common pairs (they account for 99.5941% of the values):

n_cases	n_controls	frequency
26421	442271	536611
33674	449056	357925
27823	443190	30889
27460	443025	19996
32505	448088	19183
25252	441303	17835
26948	442743	7337
26933	442553	3458
27311	442908	2958
26784	442436	2556
32272	448137	1988
32635	448302	1745
33311	448891	1586
33147	448584	1444
27296	442718	1363
26654	442222	1081

Do you think that I should fix the number of controls and "fit" the number of cases so that the alignment is good on the QC plot? Here is the QC plot with the provided values for n_controls and n_cases and with the QC criteria (the alignment is not great obviously).

sd QC with provided numbers of cases and controls

privefl commented 3 years ago

It seems you have 7 studies that are meta-analyzed.

To compute, n_eff = 4 / (1 / n_ca + 1 / n_co) for a meta-analysis, I think you should compute n_eff_meta = sum_k(4 / (1 / n_ca_k + 1 / n_co_k)) instead of 4 / (1 / sum_k(n_ca_k) + 1 / sum_k(n_co_k)).

The numbers you show seem to contain less samples than the total of the seven studies, which seems weird. Unless you don't include UKBB and 23andMe maybe. This would give something like this:

Nca <- c(13708, 5851, 6476, 1169, 8039)
sum(Nca)  # 35,243
Nco <- c(95282, 5866, 302402, 968, 5803)
sum(Nco)  # 410,321

# what you're probably using
4 / (1 / sum(Nca) + 1 / sum(Nco)) # 129821.5
# what you shoud use
sum(4 / (1 / Nca + 1 / Nco))  # 100612.4

But a 1.3 ratio still not be enough correction (especially when taking the sqrt).

johannfaouzi commented 3 years ago

Due to data sharing stuff, there are only four cohorts in the publicly available summary statistics:

IPDGC-NeuroX: 5,851 cases and 5,866 controls
UK BioBank: 18,618 proxy cases and 436,419 controls
SGPC: 1,169 cases and 968 controls
IPDGC: 8,036 cases and 5,803 controls

So, with these values:

n_cases <- c(
    5851,  # IPDGC-NeuroX
    18618,  # UK BioBank
    1169,  # SGPC
    8036  #IPDGC
)

n_controls <- c(
    5866,  # IPDGC-NeuroX
    436419,  # UK BioBank
    968,  # SGPC
    5803  #IPDGC
)

4 / (1 / sum(n_cases) + 1 / sum(n_controls)) # 125299.954376152
sum(4 / (1 / n_cases + 1 / n_controls)) # 98738.7200380937

The resulting QC plot is a bit better but still not enough (by the way I edited the previous figure because it was wrong):

QC with right formula

But proxy cases are not true cases. With a quick Google search, I found that genetics cause about 10% to 15% of all Parkinson's disease (I will do a better literature search later). So, if I replace the number of cases in the UKBB with 2,500:

n_cases <- c(
    5851,  # IPDGC-NeuroX
    2500,  # UK BioBank
    1169,  # SGPC
    8036  #IPDGC
)

n_controls <- c(
    5866,  # IPDGC-NeuroX
    436419,  # UK BioBank
    968,  # SGPC
    5803  #IPDGC
)

4 / (1 / sum(n_cases) + 1 / sum(n_controls)) # 67581.8636125946
sum(4 / (1 / n_cases + 1 / n_controls)) # 37256.8101737556

and the resulting QC plot with the right formula is much better:

QC with right formula and estimated number of true cases

Do you think that it is a sound methodology?

Thank you again very much for your help!

privefl commented 3 years ago

The orange/green filters seem fine. Or even red instead of orange.

johannfaouzi commented 3 years ago

Yes, orange is not a filter, it was just a reference!

Thank you so much again for your help!

privefl / bigsnpr

Slope differences between standard deviations from LD-ref and from summary statistics #247