tbigot / tpms

Tree Pattern Matching Suite
Other
1 stars 0 forks source link

rooting according to unicity scores : avoiding an arbitrary choice #7

Closed flassalle closed 12 years ago

flassalle commented 12 years ago

Hello Tom,

I told you there were issues when considering the results of tpms_computations using the "RU" (root considering unicity scores) command.

a summary of what tpms_computations does when chosing RU: this program intend to root the tree so as to minimize the sum of unicity scores observed at each node (unicity score is the log of the product of the number of times each species is observed in the subtree beneath the node, and so it is 1 when every species has maximum 1 copy and it increases with the number of copy per species; see a previous issue). This correspond to minimizing the number of inferred duplications if one wanted to reconcile the gene tree by (Dollo's) parsimony, and favors ancestral duplications rather than recent ones.

but some solutions are equivalent: several roots can yield the same sum of scores, because wherever are placed the duplicated leaves in the tree, the produce the same incongruence and so the same unicity score.

example: this tree was rooted according to unicity scores and then annotated with unicity scores:

((((([0]ATU7C_1_PE1302:1e-10,([0]ATU7B_1_PE659:1e-10,[0]ATU9A_1_PE592:1e-10)[0]0.0:1e-10)[0]0.757:0.0114439,[0]ATU5A_1_PE558:0.0234247)[0]0.679:0.0116375,([0]ATU2A_1_PE559:0.0114669,(([0]ATU1C_1_PE599:0.0113671,[0]ATU1A_1_PE2584:1e-10)[0]0.0:5.079e-07,[0]ATU1B_1_PE2604:1e-10)[0]0.772:0.0116533)[0]0.0:1.10769e-05)[0]:0.0921731,(([0]ATU5A_1_PE601:0.0736353,[0]ATU3A_1_PE535:1.0597e-06)[0]:0.0106629,[0]ATU4C_1_PE681:0.0459705)[0]:2.21558)[0.693147]:0.0894317,(((([0]ATU4B_1_PE229:1e-10,([0]ATU4A_1_PE659:1e-10,([0]ATU7C_1_PE1304:1e-10,([0]ATU7A_1_PE2695:1e-10,[0]ATU13_1_PE597:0.0115744)[0]0.0:5.793e-07)[0]0.0:6.248e-07)[0]0.0:4.687e-07)[0]0.0:4.687e-07,[0]ATU4C_1_PE746:1e-10)[0]0.0:6.709e-07,[0]ATU3A_1_PE644:0.0232356)[0]:0.036286,([0]ATU8A_1_PE550:1e-10,([0]ATU6A_1_PE2760:1e-10,[0]AGRT5_1_PE2271:1e-10)[0]0.0:1e-10)[0]:1.4957e-06)[0]:0.0894317)[2.77259];

in fact the "good" rooting should be :

(((ATU5A_1_PE601:0.0736353,ATU3A_1_PE535:1.0597E-6):0.0106629,ATU4C_1_PE681:0.0459705):1.10779,((((ATU7C_1_PE1302:1.0E-10,(ATU7B_1_PE659:1.0E-10,ATU9A_1_PE592:1.0E-10)0.0:1.0E-10)0.757:0.0114439,ATU5A_1_PE558:0.0234247)0.679:0.0116375,(ATU2A_1_PE559:0.0114669,((ATU1C_1_PE599:0.0113671,ATU1A_1_PE2584:1.0E-10)0.0:5.079E-7,ATU1B_1_PE2604:1.0E-10)0.772:0.0116533)0.0:1.10769E-5):0.0921731,((((ATU4B_1_PE229:1.0E-10,(ATU4A_1_PE659:1.0E-10,(ATU7C_1_PE1304:1.0E-10,(ATU7A_1_PE2695:1.0E-10,ATU13_1_PE597:0.0115744)0.0:5.793E-7)0.0:6.248E-7)0.0:4.687E-7)0.0:4.687E-7,ATU4C_1_PE746:1.0E-10)0.0:6.709E-7,ATU3A_1_PE644:0.0232356):0.036286,(ATU8A_1_PE550:1.0E-10,(ATU6A_1_PE2760:1.0E-10,AGRT5_1_PE2271:1.0E-10)0.0:1.0E-10):1.4957E-6):0.1788634):1.10779);

this one is equivalent concerning the minimum sum of unicity scores criterion, as it also separates ATU5A_1_PE601, ATU3A_1_PE535 and ATU4C_1_PE681 from the other copies of ATU5A, ATU3A and ATU4C.

the argument to say it is a good rooting are the following:

is it possible? it should be nice!

thank you for support and patience in recieving my development queries ;)

Florent

flassalle commented 12 years ago

another case tree rooted with tpms_computations with RU criterion:

((([0]RHISN_2_PE328:0.0780994,([0]SIMEL1_2_PE722:0.0309023,[0]SINMW_1_PE315:0.0743093)[0]0.911:0.052448)[0]0.998:0.175135,([0]MESSB_2_PE169:0.571942,(([0]ATU9A_1_FBPA:0.0105434,(([0]ATU6A_1_PE691:0.0179858,([0]AGRT5_1_PE397:0.00236918,[0]ATU8A_1_FBPA:0.00486144)[0]0.867:0.00821043)[0]0.977:0.0242271,(([0]ATU4B_1_FBPA:0.00493407,([0]ATU4C_1_FBPA:0.00489819,[0]ATU4A_1_FBPA:0.00245634)[0]0.884:0.00487167)[0]0.935:0.0133221,((([0]ATU3A_1_FBPA:0.0310498,[0]ATU13_1_PE2378:0.0236437)[0]0.908:0.0140602,([0]ATU5A_1_PE1496:0.0193196,([0]ATU1C_1_FBPA:0.00238446,([0]ATU1A_1_PE681:1e-10,[0]ATU1B_1_PE693:0.00239116)[0]0.0:1.084e-07)[0]0.996:0.0369119)[0]0.105:0.0115558)[0]0.98:0.0294683,(([0]ATU2A_1_FBPA:0.0455766,[0]ATU7C_1_PE696:0.00730757)[0]0.434:0.00245218,([0]ATU7A_1_PE693:0.00481653,[0]ATU7B_1_FBPA:0.00724624)[0]0.0:1.036e-07)[0]0.914:0.0108357)[0]0.809:0.00856339)[0]0.794:0.00637148)[0]:0.0115702)[0]:0.948443,[0]PARL1_1_PE177:1.06866)[0]:0.300389)[0]:0.208027)[0]0.986:0.194535,([0]ATU2A_1_PE2505:0.0319824,(([0]ATU4A_1_PE2085:0.00235353,([0]ATU4B_1_PE1929:0.00943543,[0]ATU4C_1_PE2545:1.179e-07)[0]0.8:0.00473636)[0]0.916:0.00740125,((([0]ATU7C_1_PE859:0.00479186,([0]ATU7A_1_PE858:0.00236456,([0]ATU7B_1_PE2366:0.00236553,[0]ATU9A_1_PE2243:0.0342792)[0]0.0:1.064e-07)[0]0.776:0.00232365)[0]0.87:0.00732756,([0]ATU3A_1_PE2540:0.0192171,(([0]ATU1B_1_PE909:9.4e-08,([0]ATU1C_1_PE2343:0.00471259,[0]ATU1A_1_PE888:1.637e-07)[0]0.897:0.0047143)[0]0.978:0.0230278,([0]ATU5A_1_PE1654:0.0151439,[0]ATU13_1_PE2578:0.0158961)[0]0.969:0.0209027)[0]0.991:0.0277479)[0]0.221:0.00739861)[0]0.748:0.00469122,([0]ATU6A_1_PE1023:0.0161468,([0]AGRT5_1_PE598:0.00973626,[0]ATU8A_1_PE2346:0.00211664)[0]0.968:0.0157425)[0]0.912:0.0114602)[0]0.384:0.00454628)[0]0.916:0.0364053)[0]1.0:0.194535)[11.7835];

in this particular case, choosing the longest branch (1.0687) should be not that bad, but I think the optimal choice must be the 0.9484-long branch (not that different in length) as it yield better distribution of taxa ; I think that would be the root chosen with the RT criterion. Again, both solutions are ex-aequo with the one proposed above for the RU criterion.

So I get to the point : can you implement some multi-criterion rooting program? I think the best would be the following:

this algorithm is determinist with only objective choices. I think it is better to run RU before RT, because as we mentioned previously, when testing all possible roots with RT, if the tree is big and the set of taxa is limited, you rapidly end up with the majority of nodes annotated with the taxonomic rank the closest to the root of the species tree, and thus you get many ex-aequo roots. At the opposite, RU is good with trees that are big compared to the size of the set of species because they show lots of duplication nodes. Then, if you first pre-select a set of good candidate roots with RU, you can adjust the result concerning smaller sub-trees with RT, for instance bringing together clades that have been split by the root, or re-ordering the nodes as in the species tree (the example above is a good illustration).

I let you think a bit before posting again... and if I got new ideas, I'll try to expose them together ;)

Flo

tbigot commented 12 years ago

Thanks for your suggestion, I’m currently working on a mixed method. I’ll advertise as soon as it will be ready.

tbigot commented 12 years ago

It works fine! Could you tell me if the rootings are constistant?

Please re-open this bug if it’s not the case.