Closed gnarw closed 4 years ago
To your first question: by default MiXeR use --fit-sequence diffevo-fast neldermead
in univariate analysis, and --fit-sequence diffevo-fast neldermead-fast brute1 brent1
. These are recommended parameters, and you don't need to specify them.
The meaning (sorry for perhaps too technical explanation...) is as follows:
diffevo-fast
applies one iteration of evolution differential optimization, using an approximate cost function. This cost function is described as gaussian approximation in the MiXeR nat.comm paper neldermead
applies Nedler-Mead optimization, using full cost function calculated via convolution approach. This was developed and described in a recent AI-MiXeR paper by my colleague A. Shadrin. In MiXeR nat. comm the univariate fit was based on sampling approach - in terms of the final results we didn't see big differences between these two approaches, except that convolution is faster and applies to all range of parameters including very high polygenicity (pi>0.03) while sampling, the way it's currently implemented, only works for small pi (pi<0.03).brute
method (1D-search on a regular grid for pi12 parameter), followed by brent
optimization method. To validate that this is valid I've re-run the same simulations as they looked similar to what's in MiXeR nat.comm supplement.For fitting the model, the recommendation is to constrain the fit on HapMap3 SNPs, i.e. use --extract
, inline with general practice set up by LD score regression. But I do realize this is quite controversial, and one ToDo on my list is to validate the difference in parameter estimates depending on the reference (1kG, HRC, UKB), with and without constraining to HapMap3 SNPs.
But, for now, the only reason to skip --extract
and use all SNPs is when you'd like to make QQ plots. I don't see reason to constrain QQ plots to HapMap3 SNPs.
Same applies to bivariate analysis. The recommendation is to constrain bivariate model fit to HapMap3 SNPs, therefore it's reasonable to load univariate results fitted on HapMap3 SNPs.
Thanks for these answers. Very helpful.
I have been running mixer on the SCZ and EDU data sets, PGC_SCZ_2014_EUR_qc_noMHC.csv.gz, and SSGAC_EDU_2018_no23andMe_noMHC.csv.gz, and the 1000G_EUR_Phase3.. reference files, as used here in the Readme.md.
I then analysed another set of GWAS´s. Lets call them trait1 and trait2. Sample size for trait 1 is 312,320 and for trait2 the sample size is 449,618. I used the same reference panel as for the SCZ_EDU analysis and the same mixer commands. I have attached the Venn resulting Venn diagrams from 4 runs (trait1_trait2.png) The results are very inconsistent. Not sure what could be the obvious cause here and what to look for. Any suggestions ?
Thanks,
@gnarw 3rd run has has a much larger polygenicity estimate in trait2. I've seen similar behavior for a very low-heritable traits, with nearly flat QQ plots.
Could you please share
@gnarw Thanks for sharing the data. In this case the problem is related to low heritability (both traits have SNP h2 around 0.02). Calculating BIC shows that in this case BIC(infinitesimal model) is lower than BIC(causal mixture model), which is a problem as mentioned in the MiXeR paper Discussion section:
... MiXeR applies Bayesian information criterion (BIC) to compare causal mixture model versus the infinitesimal model, as shown in Supplementary Table 9. The cases where BIC selects the infinitesimal model indicate that the GWAS sample size is insufficient to reliably fit the polygenicity parameter.
In your case that didn't crash convergence for trait1, presumably because it has slightly lower polygenicity (therefore, it is easier to work with). The argument is a bit circular, but I see all four runs gave consistent estimate of polygenicity for trait1, and it seem to converge to 0.0021. For the second trait, two runs with consistent answer gave 0.006 and 0.007, so it is not unlikely that pi parameter is 3 times higher than in the first trait.
import json
import numpy as np
def calc_BIC(params): # lower is better
return np.log(params['cost_n']) * params['cost_df'] + 2 * params['cost']
def calc_AIC(params):
return np.log(params['cost_n']) * 2 + 2 * params['cost']
for trait in ['trait1', 'trait2']:
for run in ['run1', 'run2', 'run3', 'run4']:
data = json.loads(open('{}.fit - {}.json'.format(trait, run)).read())
h2 = data['ci']['h2']['point_estimate']
pi = data['ci']['pi']['point_estimate']
aic = calc_AIC(data['inft_optimize'][-1][1]) - calc_AIC(data['optimize'][-1][1])
bic = calc_BIC(data['inft_optimize'][-1][1]) - calc_BIC(data['optimize'][-1][1])
print('{} {} h2={:.3f} pi={:.3f} AIC={:.3} BIC={:.3}'.format(trait, run, h2, pi, aic, bic))
Result (negative BIC indicates that causal mixture doesn't yield statistically significant improve over an infinitesimal model).
trait1 run1 h2=0.020 pi=0.001 AIC=7.81 BIC=-3.88
trait1 run2 h2=0.021 pi=0.001 AIC=10.9 BIC=-0.829
trait1 run3 h2=0.022 pi=0.002 AIC=11.4 BIC=-0.266
trait1 run4 h2=0.021 pi=0.002 AIC=10.8 BIC=-0.876
trait2 run1 h2=0.025 pi=0.006 AIC=4.03 BIC=-7.47
trait2 run2 h2=0.026 pi=0.007 AIC=3.87 BIC=-7.62
trait2 run3 h2=0.025 pi=0.070 AIC=2.0 BIC=-9.5
trait2 run4 h2=0.023 pi=0.003 AIC=-5.68 BIC=-17.2
Hope this answer your question.. I'll keep the ticket open to integrate AIC/BIC into precimed/mixer_figures.py
scripts - it should be clear from MiXeR runs that low heritability and (presumably) high polygenicity of the trait result in unreliable polygenicity estimates in univariate analysis, and therefore bivariate analysis will be also unstable.
Ok. I suspected this to be the problem. Thank you for looking at this problem.
Sent from my iPhone
On 16 Oct 2019, at 20:00, Oleksandr Frei notifications@github.com wrote:
@gnarw Thanks for sharing the data. In this case the problem is related to low heritability (both traits have SNP h2 around 0.02). Calculating BIC shows that in this case BIC(infinitesimal model) is lower than BIC(causal mixture model), which is a problem as mentioned in the MiXeR paper Discussion section:
... MiXeR applies Bayesian information criterion (BIC) to compare causal mixture model versus the infinitesimal model, as shown in Supplementary Table 9. The cases where BIC selects the infinitesimal model indicate that the GWAS sample size is insufficient to reliably fit the polygenicity parameter.
In your case that didn't crash convergence for trait1, presumably because it has slightly lower polygenicity (therefore, it is easier to work with). The argument is a bit circular, but I see all four runs gave consistent estimate of polygenicity for trait1, and it seem to converge to 0.0021. For the second trait, two runs with consistent answer gave 0.006 and 0.007, so it is not unlikely that pi parameter is 3 times higher than in the first trait.
import json import numpy as np
def calc_BIC(params): # lower is better return np.log(params['cost_n']) params['cost_df'] + 2 params['cost'] def calc_AIC(params): return np.log(params['cost_n']) 2 + 2 params['cost']
for trait in ['trait1', 'trait2']: for run in ['run1', 'run2', 'run3', 'run4']: data = json.loads(open('{}.fit - {}.json'.format(trait, run)).read()) h2 = data['ci']['h2']['point_estimate'] pi = data['ci']['pi']['point_estimate'] aic = calc_AIC(data['inft_optimize'][-1][1]) - calc_AIC(data['optimize'][-1][1]) bic = calc_BIC(data['inft_optimize'][-1][1]) - calc_BIC(data['optimize'][-1][1]) print('{} {} h2={:.3f} pi={:.3f} AIC={:.3} BIC={:.3}'.format(trait, run, h2, pi, aic, bic)) Result (negative BIC indicates that causal mixture doesn't yield statistically significant improve over an infinitesimal model).
trait1 run1 h2=0.020 pi=0.001 AIC=7.81 BIC=-3.88 trait1 run2 h2=0.021 pi=0.001 AIC=10.9 BIC=-0.829 trait1 run3 h2=0.022 pi=0.002 AIC=11.4 BIC=-0.266 trait1 run4 h2=0.021 pi=0.002 AIC=10.8 BIC=-0.876 trait2 run1 h2=0.025 pi=0.006 AIC=4.03 BIC=-7.47 trait2 run2 h2=0.026 pi=0.007 AIC=3.87 BIC=-7.62 trait2 run3 h2=0.025 pi=0.070 AIC=2.0 BIC=-9.5 trait2 run4 h2=0.023 pi=0.003 AIC=-5.68 BIC=-17.2 Hope this answer your question.. I'll keep the ticket open to integrate AIC/BIC into precimed/mixer_figures.py scripts - it should be clear from MiXeR runs that low heritability and (presumably) high polygenicity of the trait result in unreliable polygenicity estimates in univariate analysis, and therefore bivariate analysis will be also unstable.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or unsubscribe.
Looking at the mixer fit-sequence options I have been thinking what fitting options to choose. Here is a list of 3 items. It would be helpful if you could try to help me understand these options etc.