Recommended fitting procedure

gnarw commented 4 years ago

Looking at the mixer fit-sequence options I have been thinking what fitting options to choose. Here is a list of 3 items. It would be helpful if you could try to help me understand these options etc.

What options are recommend for the univariate and bivariate analysis.?
Is it recommended to first fit the model with the --extract option and then subsequently using all SNPs. ?
In the readme file there is an example of univariate and bivariate analysis. The first bivariate analysis imports the results from the first set of univariate analysis. The second univariate analysis (using all SNPs) are not loaded into the bivariate analysis. If bivariate analysis is to be run for all SNPs is it not better to also load univariate results for all SNPs ?

ofrei commented 4 years ago

To your first question: by default MiXeR use --fit-sequence diffevo-fast neldermead in univariate analysis, and --fit-sequence diffevo-fast neldermead-fast brute1 brent1. These are recommended parameters, and you don't need to specify them.

The meaning (sorry for perhaps too technical explanation...) is as follows:

diffevo-fast applies one iteration of evolution differential optimization, using an approximate cost function. This cost function is described as gaussian approximation in the MiXeR nat.comm paper
neldermead applies Nedler-Mead optimization, using full cost function calculated via convolution approach. This was developed and described in a recent AI-MiXeR paper by my colleague A. Shadrin. In MiXeR nat. comm the univariate fit was based on sampling approach - in terms of the final results we didn't see big differences between these two approaches, except that convolution is faster and applies to all range of parameters including very high polygenicity (pi>0.03) while sampling, the way it's currently implemented, only works for small pi (pi<0.03).
For the bivariate fit the sequence is different because 2D convolution turned out to be way to slow, and we have to use sampling approach - exactly as described in the MiXeR paper. However, we observed convergence issues, particularly for pi12 parameter, because sampling gives a slightly noisy cost function, and Nedler-Mead optimization may stuck in a local minimum. To resolve this we applied brute method (1D-search on a regular grid for pi12 parameter), followed by brent optimization method. To validate that this is valid I've re-run the same simulations as they looked similar to what's in MiXeR nat.comm supplement.

ofrei commented 4 years ago

For fitting the model, the recommendation is to constrain the fit on HapMap3 SNPs, i.e. use --extract, inline with general practice set up by LD score regression. But I do realize this is quite controversial, and one ToDo on my list is to validate the difference in parameter estimates depending on the reference (1kG, HRC, UKB), with and without constraining to HapMap3 SNPs.

But, for now, the only reason to skip --extract and use all SNPs is when you'd like to make QQ plots. I don't see reason to constrain QQ plots to HapMap3 SNPs.

Same applies to bivariate analysis. The recommendation is to constrain bivariate model fit to HapMap3 SNPs, therefore it's reasonable to load univariate results fitted on HapMap3 SNPs.

gnarw commented 4 years ago

Thanks for these answers. Very helpful.

gnarw commented 4 years ago

I have been running mixer on the SCZ and EDU data sets, PGC_SCZ_2014_EUR_qc_noMHC.csv.gz, and SSGAC_EDU_2018_no23andMe_noMHC.csv.gz, and the 1000G_EUR_Phase3.. reference files, as used here in the Readme.md.

For the univariate analysis I´m using the following command: mixer.py fit --trait1-file (SCZ file) --out (path to output.fit) --ci-alpha 0.05 --extract w_hm3.justrs --bim-file (path.bim) --frq (path.frq) --plink-ld-bin0 path.p05_SNPwind50k.ld.bin --lib /home/gunnarr/mixer/src/build/lib/libbgmg.so
Then the same thing for the (EDU file)
For the bivariate analsysis I´m using the following: mixer.py fit --trait1-file (SCZ file) --trait2-file (EDU file) --trait1-params-file (output from SCZ univariate analysis.fit.json) --trait2-params-file (output from EDU univariate analysis.fit.json) --out (path/trait1_trait2.fit) --ci-alpha 0.05 --extract w_hm3.justrs --bim-file (path.bim) --frq (path.frq) --plink-ld-bin0 path.p05.SNPwind50k.ld.bin --lib /home/gunnarr/mixer/src/build/lib/libbgmg.so I ran these analysis 4 times. The resulting Venn diagrams (SCZ_EDU.png) is attached The results are fairly consistent.

I then analysed another set of GWAS´s. Lets call them trait1 and trait2. Sample size for trait 1 is 312,320 and for trait2 the sample size is 449,618. I used the same reference panel as for the SCZ_EDU analysis and the same mixer commands. I have attached the Venn resulting Venn diagrams from 4 runs (trait1_trait2.png) The results are very inconsistent. Not sure what could be the obvious cause here and what to look for. Any suggestions ?

SCZ_EDU trait1_trait2

Thanks,

ofrei commented 4 years ago

@gnarw 3rd run has has a much larger polygenicity estimate in trait2. I've seen similar behavior for a very low-heritable traits, with nearly flat QQ plots.

Could you please share .log files and .json files, both from univariate and bivariate analysis? To avoid sharing on github you may send them to my e-mail oleksandr.frei at gmail.com .

ofrei commented 4 years ago

@gnarw Thanks for sharing the data. In this case the problem is related to low heritability (both traits have SNP h2 around 0.02). Calculating BIC shows that in this case BIC(infinitesimal model) is lower than BIC(causal mixture model), which is a problem as mentioned in the MiXeR paper Discussion section:

... MiXeR applies Bayesian information criterion (BIC) to compare causal mixture model versus the infinitesimal model, as shown in Supplementary Table 9. The cases where BIC selects the infinitesimal model indicate that the GWAS sample size is insufficient to reliably fit the polygenicity parameter.

In your case that didn't crash convergence for trait1, presumably because it has slightly lower polygenicity (therefore, it is easier to work with). The argument is a bit circular, but I see all four runs gave consistent estimate of polygenicity for trait1, and it seem to converge to 0.0021. For the second trait, two runs with consistent answer gave 0.006 and 0.007, so it is not unlikely that pi parameter is 3 times higher than in the first trait.

import json
import numpy as np

def calc_BIC(params): # lower is better
    return np.log(params['cost_n']) * params['cost_df'] + 2 * params['cost']
def calc_AIC(params):
    return np.log(params['cost_n']) * 2 + 2 * params['cost']

for trait in ['trait1', 'trait2']:
    for run in ['run1', 'run2', 'run3', 'run4']:
        data = json.loads(open('{}.fit - {}.json'.format(trait, run)).read())
        h2 = data['ci']['h2']['point_estimate']
        pi = data['ci']['pi']['point_estimate']
        aic = calc_AIC(data['inft_optimize'][-1][1]) - calc_AIC(data['optimize'][-1][1])
        bic = calc_BIC(data['inft_optimize'][-1][1]) - calc_BIC(data['optimize'][-1][1])
        print('{} {} h2={:.3f} pi={:.3f} AIC={:.3} BIC={:.3}'.format(trait, run, h2, pi, aic, bic))

Result (negative BIC indicates that causal mixture doesn't yield statistically significant improve over an infinitesimal model).

trait1 run1 h2=0.020 pi=0.001 AIC=7.81 BIC=-3.88
trait1 run2 h2=0.021 pi=0.001 AIC=10.9 BIC=-0.829
trait1 run3 h2=0.022 pi=0.002 AIC=11.4 BIC=-0.266
trait1 run4 h2=0.021 pi=0.002 AIC=10.8 BIC=-0.876
trait2 run1 h2=0.025 pi=0.006 AIC=4.03 BIC=-7.47
trait2 run2 h2=0.026 pi=0.007 AIC=3.87 BIC=-7.62
trait2 run3 h2=0.025 pi=0.070 AIC=2.0 BIC=-9.5
trait2 run4 h2=0.023 pi=0.003 AIC=-5.68 BIC=-17.2

Hope this answer your question.. I'll keep the ticket open to integrate AIC/BIC into precimed/mixer_figures.py scripts - it should be clear from MiXeR runs that low heritability and (presumably) high polygenicity of the trait result in unreliable polygenicity estimates in univariate analysis, and therefore bivariate analysis will be also unstable.

gnarw commented 4 years ago

Ok. I suspected this to be the problem. Thank you for looking at this problem.

Sent from my iPhone

On 16 Oct 2019, at 20:00, Oleksandr Frei notifications@github.com wrote:

@gnarw Thanks for sharing the data. In this case the problem is related to low heritability (both traits have SNP h2 around 0.02). Calculating BIC shows that in this case BIC(infinitesimal model) is lower than BIC(causal mixture model), which is a problem as mentioned in the MiXeR paper Discussion section:

... MiXeR applies Bayesian information criterion (BIC) to compare causal mixture model versus the infinitesimal model, as shown in Supplementary Table 9. The cases where BIC selects the infinitesimal model indicate that the GWAS sample size is insufficient to reliably fit the polygenicity parameter.

In your case that didn't crash convergence for trait1, presumably because it has slightly lower polygenicity (therefore, it is easier to work with). The argument is a bit circular, but I see all four runs gave consistent estimate of polygenicity for trait1, and it seem to converge to 0.0021. For the second trait, two runs with consistent answer gave 0.006 and 0.007, so it is not unlikely that pi parameter is 3 times higher than in the first trait.

import json import numpy as np

def calc_BIC(params): # lower is better return np.log(params['cost_n']) params['cost_df'] + 2 params['cost'] def calc_AIC(params): return np.log(params['cost_n']) 2 + 2 params['cost']

for trait in ['trait1', 'trait2']: for run in ['run1', 'run2', 'run3', 'run4']: data = json.loads(open('{}.fit - {}.json'.format(trait, run)).read()) h2 = data['ci']['h2']['point_estimate'] pi = data['ci']['pi']['point_estimate'] aic = calc_AIC(data['inft_optimize'][-1][1]) - calc_AIC(data['optimize'][-1][1]) bic = calc_BIC(data['inft_optimize'][-1][1]) - calc_BIC(data['optimize'][-1][1]) print('{} {} h2={:.3f} pi={:.3f} AIC={:.3} BIC={:.3}'.format(trait, run, h2, pi, aic, bic)) Result (negative BIC indicates that causal mixture doesn't yield statistically significant improve over an infinitesimal model).

trait1 run1 h2=0.020 pi=0.001 AIC=7.81 BIC=-3.88 trait1 run2 h2=0.021 pi=0.001 AIC=10.9 BIC=-0.829 trait1 run3 h2=0.022 pi=0.002 AIC=11.4 BIC=-0.266 trait1 run4 h2=0.021 pi=0.002 AIC=10.8 BIC=-0.876 trait2 run1 h2=0.025 pi=0.006 AIC=4.03 BIC=-7.47 trait2 run2 h2=0.026 pi=0.007 AIC=3.87 BIC=-7.62 trait2 run3 h2=0.025 pi=0.070 AIC=2.0 BIC=-9.5 trait2 run4 h2=0.023 pi=0.003 AIC=-5.68 BIC=-17.2 Hope this answer your question.. I'll keep the ticket open to integrate AIC/BIC into precimed/mixer_figures.py scripts - it should be clear from MiXeR runs that low heritability and (presumably) high polygenicity of the trait result in unreliable polygenicity estimates in univariate analysis, and therefore bivariate analysis will be also unstable.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or unsubscribe.

precimed / mixer

Recommended fitting procedure #14