iTaxoTools / TaxI2

Calculation and analysis of pairwise sequence distances
GNU General Public License v3.0
0 stars 0 forks source link

Implement support of phylogenetic trees to calculate genetic distances between direct sister species #4

Open mvences opened 3 years ago

mvences commented 3 years ago

In order to improve the informativeness of the TaxI3 output, one important additional features would be the support for phylogenetic trees in order to identify among the different species the direct sister species. This will be an extension of the "all against all" comparison

It will include three levels of complexity.

The first level will be support for phylogenetic trees provided as additional input by the user. In these, we need to make sure that the taxon names in the tree can be directly linked to the sequence files. We should first implement the feature with tab-delimited and Excel input - here the requirement will be that the taxon name in the tree must be identical with the seqid field. I will below provide some suggestions for implementing this first level. Once it is done, we can think about the second and third level.

The second will be to make TaxI3 calculate trees on its own, directly from the sequences. There are many programs for this, we can start implementing very simple algorithms but eventually want to have some more complex ones. Some preliminary suggestions on this below.

The third level will be to also consider in the calculation some kind of support value. Whether two taxa are sister to each other in a phylogenetic tree can be weakly or strongly supported by the data, and there are means to assess this, for instance through pseudoreplication (bootstrap) which is implemented in many tree-searching programs. This we will leave until the very end.

For being able to work with phylogenetic trees, we need a dedicated library for trees. There are several of them:

What is a sister species? See the graph below for a very simple explanation.

sister species 1

However, in a data set for TaxI3, we will typically have trees with more than one sequence per species. So, in the following example, A and B are sister species. An important concept for this is that all sequences assigned in the data set to one species are grouped together (they are monophyletic):

sister species 2

While in the the following example, sequences A vs. B are not reciprocally monophyletic, and so cannot be considered as true sister species:

sister species3

So here is what the program should do:

Lastly, here already some preliminary thoughts about how to calculate trees within TaxI (second level of complexity, see above):

necrosovereign commented 3 years ago

@mvences Could you provide sample input for this functionality?

necrosovereign commented 3 years ago

I also need a table of sequences that would be inputted with these trees.

mvences commented 3 years ago

OK, the following ZIP file includes:

The interpretation of the tree and sequences should be as in the following graphs. There are two pairs of sister species as recognizable from the trees. The genetic distance values from these two comparisons should be tabulated separately from the other inter-species values.

sister species example