uqrmaie1 / admixtools

https://uqrmaie1.github.io/admixtools
71 stars 14 forks source link

Question: The difference of qpAdm/qpWave model fitting between Admixtools 1 and 2. #29

Open dongkeeeee opened 1 year ago

dongkeeeee commented 1 year ago

Hello. My name is Donghee Kim. I’m a phD student of Jeong lab, of which the project instructor is Choongwon Jeong, in Seoul National University. I have a question about the difference of qpAdm/qpWave model fitting between Admixtools 1 and 2.

I wanted to make a comparison with previous f4 and qpAdm results made by Admixtools1. First, I modified fstatistics calculation option (poly_only=F, jacknife block=5e6) for making the result same between Admixtools 1 and 2. Then I got a same f4 results with previous ones. However, chi-square values of qpAdm/qpWave model fitting are still different from each other. I read the manual and paper of Admixtools 1 and 2 but couldn’t find any difference for qpAdm/qpWave modelling mechanism. I guess maybe that R packages for estimating error matrix E could make a difference.

To summarize, I wonder what makes the subtle difference of the qpAdm/qpWave model fit between Admixtools 1 and 2. Also, I wonder if there are any options to make the model fit the same for comparisons with previous results.

Thank you.

Sincerely, Donghee kim.

uqrmaie1 commented 1 year ago

Can you share the parameter file you used for Admixtools 1, the exact commands you used in Admixtools 2, and the outputs you get?

If the results are substantially different, the most likely reason is that the settings don't match. If the settings match, there can still be a small difference; I never tracked down the reason for these very small differences. They seemed too small to be of much relevance. It's possible that it has to do with how the error matrix is estimated. If the boundaries of the jackknife blocks don't match perfectly between Admixtools 1 and 2, that can be enough to make the estimated variances slightly different.

dongkeeeee commented 1 year ago

Thank you for quick answer. Here are the commands & results. I don't understand why chisq values are significantly different while weight values are not.

Admixtools 1 Command library(admixtools)

fn1 <- "test"

left = c('Muturu','Sahiwal') right =c('EuropeanBison', 'AmericanBison', 'Simmental', 'Hanwoo') target = 'Afar'

a <- qpadm_wrapper(fn1,left,right,target,'/opt/ohpc/pub/apps/AdmixTools/bin/qpAdm', parfile="./parfile", outdir = "./")

Results weights target left weight mean se z 1 Afar Muturu 0.273 0.273 0.005 54.5 2 Afar Sahiwal 0.727 0.727 0.005 145.

popdrop pat wt dof chisq p Muturu Sahiwal feasible 1 00 0 2 2.65 0.266 0.273 0.727 TRUE 2 01 1 3 8971. 0 1 0 TRUE 3 10 1 3 2751. 0 0 1 TRUE

Admixtools 2 Commands library(admixtools)

fn1 <- "test"

left = c('Muturu','Sahiwal') right =c('EuropeanBison', 'AmericanBison', 'Simmental', 'Hanwoo') target = 'Afar'

mypops <- c(left,right,target)

dt <- f2_from_geno(fn1, pops=mypops ,auto_only = F, maxmem = 10000, poly_only=F, blgsize=5000000)

b <- qpadm(dt,left,right,target)

Results weights target left weight se z 1 Afar Muturu 0.273 0.00488 55.8 2 Afar Sahiwal 0.727 0.00488 149.

popdrop pat wt dof chisq p f4rank Muturu Sahiwal feasible best dofdiff 1 00 0 2 4.19 0.123 1 0.273 0.727 TRUE NA NA 2 01 1 3 10135. 0 0 1 NA TRUE TRUE 0 3 10 1 3 3098. 0 0 NA 1 TRUE TRUE NA

uqrmaie1 commented 1 year ago

Thank you for posting the commands!

I had another look at this, and I found a few reasons why qpadm chi-squared- and p-values can be different between Admixtools 1 and Admixtools 2.

  1. SNP block boundaries

In Admixtools 1, missing SNPs are not excluded when SNP block boundaries are calculated, but in Admixtools 2 they are excluded first when running f2_from_geno() or extract_f2() (but not when reading the data from genotype files directly). This can result in slightly different SNP blocks, which can affect the p-values a bit. I could change that so the SNP blocks are always computed based on all SNPs to make things more consistent, but it's not too straight-forward to change this, and it's possible that the potential for introducing new bugs is greater than the benefit from making that change.

  1. SNP block lengths

Admixtools 1 calculates SNP block boundaries based on all SNPs, but the calculated block lengths do not include missing SNPs. Admixtools 2, when reading the data from genotype files directly, calculated the block lengths based on all SNPs, which is wrong. I changed that in the latest version. So in the latest version, if you run your example again not with f2_from_geno(), but reading the data from genotype files directly, you should see a smaller difference in p-values compared to Admixtools 1.

  1. Covariance matrix regularization

Before the covariance matrix is inverted, a regularization term is added to the diagonal elements, which is proportional to the fudge parameter times the trace of the matrix. In Admixtools 1, this is done twice. I now added a parameter fudge_twice to the qpadm function so that this behavior can be imitated.

  1. Very large chi-squared values

After fixing the bug in 2., and setting fudge_twice = TRUE, I now get very close to identical chi-squared and p-values in my tests for qpadm in Admixtools 1 and 2, except sometimes when the chi-squared values are very large. I might get 100 vs 1000 in Admixtools 1 vs 2, for example. I haven't looked into this further, because I don't think differences in that range matter for qpadm.

Please let me know if you still see differences after setting fudge_twice = TRUE and reading the data from genotype files directly. If yes, the best way for me to look into this would be to replicate this on your data, if you can share it!