iqtree / iqtree2

NEW location of IQ-TREE software for efficient phylogenomic software by maximum likelihood http://www.iqtree.org
GNU General Public License v2.0
221 stars 53 forks source link

Does it make sense to convert SNP data to binary data? #111

Open Yyeserin opened 1 year ago

Yyeserin commented 1 year ago

Hi,

I converted my SNP data to binary. The characters are 0, 1 and 2. I use the script below to run IQtree:

iqtree2 -s populations.snps.92ind.binary.NEX -st BIN -m TEST+ASC -bb 2000 -wbt -alrt 2000 -abayes -nm 2000 -nt 8 -bnni

IQ-tree handles this data as morphological data. I use this method because using ASC model always fails as I cannot get rid of invariant sites. Not using an ASC model also doesn't make sense for SNP data, either.

Actually the trees look pretty good when I use binary format. However the model search tests morphological models. This makes me feel uncomfortable. Do you think that it is a bad idea to code the data like binary?

I appreciate your suggestsions a lot.

Best, Yeserin.

ModelFinder will test up to 2 morphological models (sample size: 9868) ... No. Model -LnL df AIC AICc BIC 1 MK+FQ+ASC 604290.128 181 1208942.256 1208949.058 1210244.923 2 MK+FQ+ASC+G4 580658.860 182 1161681.721 1161688.599 1162991.584 Akaike Information Criterion: MK+FQ+ASC+G4 Corrected Akaike Information Criterion: MK+FQ+ASC+G4 Bayesian Information Criterion: MK+FQ+ASC+G4 Best-fit model: MK+FQ+ASC+G4 chosen according to BIC

bqminh commented 1 month ago

I don't see any problem with this. MK is a Juke-Cantor-type model with equal substitution rates between states. Btw, what does 2 encode for? Btw, I thought that when using +ASC with invariant sites in the alignment, IQ-TREE will actually print a file without invariant sites, which you can run later.

Yyeserin commented 1 month ago

Hi again,

Thank you for your reply.

0, 1 and 2 encodes for AA, AB, and BB, if it was your question.

There was a bug in the IQtree version, which made thee file without invariant sites didn't work at that time. When I swithced to a newer version, the file worked. So, I could use invariant sites in the end.

Best, Yeserin.

bqminh commented 1 month ago

I see, this is actually good to know. Then I think perhaps MK model's assumption (equal substitution rate) might not reasonable. For example, AA mutating to AB is easier than AA to BB. I'd actually add GTR model into the set of testing models by using -mset MK,GTR to see if GTR provides a better fit to the data or not.

The default for morphological data is -mset MK to avoid users from using GTR for very large number of states. But if you have just 3 states (even lower than DNA data) then it's OK to use GTR model.

You can do a quick run, to see if it changes anything or not.