Open roblanf opened 2 years ago
We can build example NextFlow pipelines to run these analyses and make nice comparison plots in R.
With many data sets in HIV research, we know at least some aspects of the phylogeny or true tree. For example we know of some transmission chains where each "donor" virus has to be ancestral to each "recipient". In my experience, when there are any significant differences between trees, it is nearly always due to input data and not the methods and models used. One major issue is the choice of an outgroup (or use no outgroup and choose midpoint rooting), and why the point where the outgroup joins to the ingroup is nearly always wrong. Another issue is recombination such that the true tree is not bifurcating (and splitstree and other methods of representing networks are not always an ideal solution). Another issue is extreme distances between sequences, either too few mutations as in SARS-CoV-2 data, or saturation of silent sites as in comparing all primate lentiviruses. Another issue is the size of the data set, in many cases it is now very easy to find and align sequences from thousands of isolates (SARS-CoV-2 genomes now number in the tends of millions) so building a true ML tree is impossible.
For one example of the outgroup issue, there are many papers asking "which lineage of placental mammals is most ancestral?". And there is quite a distance between the monotremes, marsuials and placental mammals. With placental mammals, there is no reason to expect a perfect "star phylogeny". It is possible that something like an elephant shrew family evolved for millions of years before primates, rodents, bats, etc. each evolved out of it at different times. But with HIV-1 M group we are almost certain there was a single point introduction from one chimpanzee into humans (there were other introductions, for the N, O and P groups, and for HIV-2 etc), and yet using SIV-Chimpanzee as the outgroup, the joining point is essentially never the center of the M star phylogeny and instead is significantly far up one of the subtype branches, such that it looks like this clade was "ancestral" and evolved for some time before the other clades split off. Then measuring distances from this "root point" to tips shows that this clade has shorter branch lengths (evolves much slower). Distances from the center of the M group to tips are close to equal, all clades evolving at close to the same rate.
For most alignment issues I have seen, the alignment is fairly simple if the data is in a reasonable range of distances for phylogenetic inference. It is easy, for example to align all HIV-1 M group genomes except for a few regions which are very rich in insertions/deletions. Aligning all primate lentivirus genomes is more difficult. The problem I observe most often in large datasets used for inference of mammal, fish, bird, reptile or insect evolution and the data made available in a repository such as TreeBase, is that they have a huge matrix of hundreds of genes from dozens of species, but with large percentages of missing data filled in with NNNNN or ?????? characters. In one case I studied, the authors inferred that some lineges of birds evolved much faster than others, but the data set showed that the fast-evolving lineages had a higher percentage of mitochondrial genes in the matrix, and mitochondrial DNA evolves nearly 10X faster than nuclear DNA.
Reviewers of papers almost always want bootstrap support values, and many details about the models and methods of phylogenetic reconstruction, but seem to never scrutinize the quality of the data.
Some ideas from a chat with Fred Jaya. Use the wiki to give a few examples of simple benchmarks, e.g.