uqrmaie1 / admixtools

https://uqrmaie1.github.io/admixtools
71 stars 14 forks source link

Comparing graphs with compare_fits #67

Open ndussex opened 3 months ago

ndussex commented 3 months ago

Hi all,

I am still a bit confused about model selection and it would be great to have some input. I first ran the find_graphs command to generate some graph for a dataset with 5 populations and varying admixture events from 0 to 7 (see figure attached). The scores drop a lot between admix=0 and admix=2.

I then used the qpgraph_resample_multi and compare_fits functions to compare each model such as: admix0 vs admix1, admix1 vs admix2, etc... and I obtain this:

admix_0 vs admix_1

$diff

[1] 6.497123

$se

[1,] 4.868707

$z

[1,] 1.334466

$p

[1] 0.1820512

$p_emp

[1] 0.01

$p_emp_nocorr

[1] 0

$ci_low

[1] 0.5623495

$ci_high

[1] 18.21102

admix_1 vs admix_2

[1] 0.8507611

$diff

$se

[1,] 1.075474

$z

[1,] 0.7910565

$p

[1] 0.428911

$p_emp

[1] 0.06

$p_emp_nocorr

[1] 0.06

$ci_low

[1] -0.0001026993

$ci_high

[1] 3.659816

admix_2 vs admix_3

$diff

[1] 0.2281275

$se

[1,] 0.790918

$z

[1,] 0.2884339

$p

[1] 0.7730147

$p_emp

[1] 0.76

$p_emp_nocorr

[1] 0.76

$ci_low

[1] -0.6703945

$ci_high

[1] 2.532137

admix_3 vs admix_4

$diff

[1] 0.09823237

$se

[1,] 0.3638116

$z

[1,] 0.2700089

$p

[1] 0.7871534

$p_emp

[1] 0.24

$p_emp_nocorr

[1] 0.24

$ci_low

[1] -0.5779395

$ci_high

[1] 1.15629

admix_4 vs admix_5

$diff

[1] 0.112138

$se

[1,] 0.3549512

$z

[1,] 0.315925

$p

[1] 0.7520595

$p_emp

[1] 0.01

$p_emp_nocorr

[1] 0

$ci_low

[1] 6.032491e-05

$ci_high

[1] 0.7943132

admix_5 vs admix_6

$diff

[1] 1.288916e-05

$se

[1,] 2.059777e-05

$z

[1,] 0.6257553

$p

[1] 0.5314754

$p_emp

[1] 0.38

$p_emp_nocorr

[1] 0.38

$ci_low

[1] -3.753325e-05

$ci_high

[1] 2.950797e-05

admix_6 vs admix_7

$diff

[1] 1.697339e-05

$se

[1,] 4.100945e-05

$z

[1,] 0.4138897

$p

[1] 0.6789549

$p_emp

[1] 0.72

$p_emp_nocorr

[1] 0.72

$ci_low

[1] -3.357979e-06

$ci_high

[1] 0.0001371293

Now, I am not entirely sure how to select the model that explains best my data. It looks like the difference between models is greatest between admix_0 and admix_1 and to some extent admix_1 and admix_2 and the p_emp value is significant to close-to-significant, respectively. For the other comparisons, the difference is much lower and p_emp not significant. So, I would be tempte to say that admix_2 is the best model based on its score (i.e. 0.000836) and near significance.

Am I getting this right?

Thanks!

Figure S2

uqrmaie1 commented 2 months ago

You are mostly getting this right, but compare_fits() should be used to compare the fits of graphs with the same number of admixture events, not graphs with a different number of admixture events. More admixture events can result in lower scores even when the topology is further from the truth.

If you want to compare graphs with a different number of admixture events, you could use thescore_test column in the output of qpgraph_resample_multi() (instead of the score column), which is calculated only based on SNP blocks which were not used in optimizing the fit in each bootstrap replicate. However, I haven't tested this a lot, and it seemed to be too conservative (in the sense that the topology differences would have to be very large before the difference in scores was significant).

Another thing I would recommend is to not only look at the single best graph for each N, but to look at several graphs with low scores. If there are multiple different graphs with the same N and similar scores (not significantly different according to compare_fits()), that means that there isn't enough data for the complexity of the model, and you may have to fit simpler models (lower N) to avoid overfitting.

ndussex commented 2 months ago

Thanks for your reply and suggestion.

So, I tested what you suggested, i.e. using the score_test column instead, which gave me the following (graph0 = 0 admixture, graph1 = 1 admixture and so on).

The greatest difference is between graph4 and graph5, with the graph4 having the lowest score and thus explaining best my data. However, it is not significant. Does this mean that fewer admixture events actually explain better my data?

Thanks a lot!

graph0, graph1

fits = qpgraph_resample_multi(f2_blocks, list(graph0, graph1), nboot = 100) compare_fits(fits[[1]]$score_test, fits[[2]]$score_test)

$diff

[1] -0.7311143

$se

[,1]

[1,] 8.199706

$z

[,1]

[1,] -0.08916348

$p

[1] 0.928952

$p_emp

[1] 0.78

$p_emp_nocorr

[1] 0.78

$ci_low

[1] -23.85085

$ci_high

[1] 9.367359

graph1, graph2

fits = qpgraph_resample_multi(f2_blocks, list(graph1, graph2), nboot = 100) compare_fits(fits[[1]]$score_test, fits[[2]]$score_test)

$diff

[1] -0.3664455

$se

[1,] 2.743555

$z

[1,] -0.1335659

$p

[1] 0.8937458

$p_emp

[1] 0.82

$p_emp_nocorr

[1] 0.82

$ci_low

[1] -7.62789

$ci_high

[1] 3.723855

graph2, graph3

fits = qpgraph_resample_multi(f2_blocks, list(graph2, graph3), nboot = 100) compare_fits(fits[[1]]$score_test, fits[[2]]$score_test)

$diff

[1] -0.3756326

$se

[1,] 2.215116

$z

[1,] -0.1695769

$p

[1] 0.8653429

$p_emp

[1] 0.86

$p_emp_nocorr

[1] 0.86

$ci_low

[1] -3.324304

$ci_high

[1] 0.6999956

graph3, graph4

fits = qpgraph_resample_multi(f2_blocks, list(graph3, graph4), nboot = 100) compare_fits(fits[[1]]$score_test, fits[[2]]$score_test)

$diff

[1] -1.60316

$se

[1,] 2.911576

$z

[1,] -0.5506158

$p

[1] 0.5818971

$p_emp

[1] 0.42

$p_emp_nocorr

[1] 0.42

$ci_low

[1] -9.219281

$ci_high

[1] 0.05858196

graph4, graph5

fits = qpgraph_resample_multi(f2_blocks, list(graph4, graph5), nboot = 100) compare_fits(fits[[1]]$score_test, fits[[2]]$score_test)

$diff

[1] -1.729622

$se

[1,] 3.674269

$z

[1,] -0.4707392

$p

[1] 0.637827

$p_emp

[1] 0.62

$p_emp_nocorr

[1] 0.62

$ci_low

[1] -12.75352

$ci_high

[1] 0.07438104

graph4, graph5

fits = qpgraph_resample_multi(f2_blocks, list(graph5, graph6), nboot = 100) compare_fits(fits[[1]]$score_test, fits[[2]]$score_test)

$diff

[1] -0.00224161

$se

[1,] 0.02516034

$z

[1,] -0.08909302

$p

[1] 0.929008

$p_emp

[1] 0.94

$p_emp_nocorr

[1] 0.94

$ci_low

[1] -0.05009878

$ci_high

[1] 0.03952165