pierrebarbera / epa-ng

Massively parallel phylogenetic placement of genetic sequences
GNU Affero General Public License v3.0
78 stars 7 forks source link

Zero-length branches altered #38

Open davidemms opened 3 years ago

davidemms commented 3 years ago

Hi

Firstly, thanks for developing such a great tool! I've had an odd issue occur whereby some zero-length branches in the reference tree get replaced with a branch length of 0.1053605157 in the epa_result.jplace file. I've attached an example that reproduces the issue: submit.tar.gz

The command I ran was:

epa-ng -T 4 -m LG --redo --tree test.tree.tre --ref-msa test.msa.ref.fa --query test.msa.query.fa --preserve-rooting on --outdir out

test.tree.tre was: "(g33013252_Mapoly0056s0034.1.p:0,((g33020677_Mapoly0173s0004.1.p:0,g33026704_Mapoly0052s0024.1.p:1.11896)100:0.662911,g33026973_Mapoly0016s0089.1.p:0.312835)1:0.234546);"

whereas the relevant line in epa_result.jplace was "tree": "(g33013252_Mapoly0056s0034.1.p:0.0000000000{0},((g33020677_Mapoly0173s0004.1.p:0.1053605157{1},g33026704_Mapoly0052s0024.1.p:1.1189600000{2})100:0.6629110000{3},g33026973_Mapoly0016s0089.1.p:0.3128350000{4})1:0.2345460000{5});",

Note the zero branch length for "g33013252_Mapoly0056s0034.1.p" and the non-zero branch for "g33020677_Mapoly0173s0004.1.p"

This also happens in a larger tree, with the same resulting branch-length, even from branches far away from where the gene was inserted. Many/all of the zero length branches were replaced with a branch length of the same value as in the example above, 0.1053605157.

All the best David

pierrebarbera commented 3 years ago

Hi David,

thank you for bringing this to my attention. So the bug here seems to be that not all branch lengths get set from 0 to that value.

Setting the value is more of a "feature", though admittedly not a great one. What should happen is that the program gives an error when it sees a ref tree with 0 branch lengths, as it makes very little sense from the point of view of the algorithm: how can it insert a query on a branch that doesn't have a length? It also will try to find the exact location along the branch, but how can it do that on such a branch?

How are you getting these 0 branches, does the tree come from an inference program like IQTREE/RAxML? What you may want to do is put the tree through raxml-ng --evaluate, which will give the tree sensible lengths (and also give you a model file that you can pass to -m instead of the default values used by -m LG in your call),

Cheers, Pierre

davidemms commented 3 years ago

Hi Pierre

Thanks for getting back to me on this. Just a question about the algorithm as I'm curious as to how it works. My understanding of a tree with a zero branch length would be, for example, a human and gorilla gene that have not had any sequence changes since they diverged. The terminal branch lengths would be zero in that case. For such a situation, the algorithm can't put the new query gene on that branch, as you say, but that's fine because there isn't any situation when it should anyway. In the situation when it too had the same sequence, then it should be inserted as a third child node of the human/gorilla parent node. There aren't any input sequences that make sense putting in 'along' the zero length branch.

I realise now that the example I submitted was a bit artificial as I hand edited the files to create a small example. I've attached the actual tree and MSA for which I observed the problem. This shows the case where there are identical sequences: OG0000320.zip

In answer to your question, yes, this tree comes from IQTREE. It definitely sounds like your suggestion of using "raxml-ng --evaluate" could resolve this, but I'd be curious to know what it would do differently here, it seems like the zero branch lengths are the correct answer?

Cheers David

lczech commented 3 years ago

Hi David,

you are right with your human and gorilla example: for identical sequences, zero branch lengths of course make sense, and placing additional sequences on those branches doesn't, but instead place them on the parent branch.

So, what to do? My take on this is that there seem to be two perspectives here: Biologically, it does make sense to think of identical sequences as having zero length branches. Algorithmically on the other hand, we usually want to define tree inference (and, by extension, phylogenetic placement) as a problem on a set of distinct sequences. Admitted, in real world data (such as yours, apparently), that is not always the case, and I'd argue that it might make sense to develop software that can handle these cases.

Hence, I see two ways forward here: Work with a set of distinct reference sequences (i.e., delete duplicates before tree inference), or maybe it makes sense to extend EPA to work with these cases, but @Pbdas has a better grasp on whether that makes sense in this particular case, so let's wait for his take on this.

Hope that I understood your questions correctly and that this helped clarify it a little bit.

Cheers and so long Lucas

pierrebarbera commented 3 years ago

Hi David,

in your described scenario there would be two identically likely placements at the Human and Gorilla terminal branches, since there is no signal to distinguish between them. Thats purely a result of 1) placement always happening on branches and 2) the math behind the likelihood calculation (which works out to the same number since the sequences and branch lengths are identical).

I think the possible solution that Lucas notes (placing on the parent instead) would be a decent compromise, and the straight-forward way to implement it would be to ignore all branches that are 0 length.

Still, 0 length branches indicate a problematic reference tree to me since I don't see the value of having a reference tree/alignment like that. Usually the goal is to include disparate reference sequences, such that we can see if the query sequence belongs to either one or the other. If the reference sequences are identical, there is no way a query could ever tell you if it belongs to one over the other. I looked at your alignment and ~100 out of ~400 references are identical, so at a minimum I would trim those out. Is there any reason from your side not to do that?

If you decide to go that route, also make sure to pass the actual model parameters to EPA-ng, which can be done either as I mentioned by using raxml-ng (giving you a modelfile), or by passing the IQTREE output log file that lists them (look for SUBSTITUTION PROCESS). These are incredibly important, to the point that I should really just disallow the default model settings...

Let me know what you end up doing, and keep asking questions when you have them! Pierre

davidemms commented 3 years ago

Hi Pierre

Thanks for your reply. For me this comes about from a very large scale analysis. It's in support of this website: https://shoot.bio & preprint: https://doi.org/10.1101/2021.09.01.458564. I found your tool to be a faster alternative to IQTREE for adding a new gene to an existing tree and so it's great from a user's point of view.

The database I constructed contains gene trees for all gene families in the included species. This means that there are some cases, like the example above, where there are identical genes within a family. This particular example tree is an extreme case. For any tree though, I'd like to return this gene tree to the user showing that these identical genes exist within the family as that is important contextual information about the gene family. Especially if some of those identical genes are orthologs.

If you are able to make a change to the code to accommodate this then that would be really great. The alternative work around for me would be:

  1. Remove any identical sequences from my reference tree and MSA, leaving one example for each case
  2. Check there are no polytomies remaining in the tree and if there are then spit them arbitrarily (this is the step I currently do).
  3. Run EPA-ng
  4. Read the resulting tree programmatically, identify the position of the new gene, graft the gene into the corresponding position in the original tree which contains the identical sequences.

I can do this, but I thought I would check with you first in case you were able to support such trees instead.

All the best David

davidemms commented 3 years ago

For the model, is EPA-ng able to read the IQTREE log file to identify the model used? And are there some IQTREE models it doesn't support, I think I might have had problems with JTTDCMut and mtInv? I will make an update to pass the correct model for each tree now that I have this info (in the first pass I was using FastTree to test feasibility so hadn't done a model test).

I think "raxml-ng --evaluate" is probably not feasible for me. I have about 60,000 gene trees with around 5 million genes. Some of the gene trees are particularly large and the whole dataset took months using 4 high powered servers with IQTREE.

pierrebarbera commented 3 years ago

EPA-ng can parse IQTREE log files yes, but I made that more of a hidden feature since I'm pretty sure (as you say) I don't/maybe can't support all models coming from there. Worth trying though!

As for the feature, I think it makes sense, that or allowing placement on 0 length branches and just not performing optimization. However I don't know when I'll get to this, as I will be busy for another month minimum before I can start working on the EPA-ng backlog... maybe I can find some time on a weekend, but I can't promise much. Part of that is my stubbornness with not wanting to add feature before finally merging a larger memory management feature in, which has proven a bit of a dependency nightmare.

I'll post an update here if something moves!