Open kawabata-tomoko opened 1 year ago
Dear Team, I forgot that the tree is build from Protein sequence(GTDB markers), but those sequences I need to insert are 16S v4 region sequences which has lots of duplication. It look like the second problem may be caused by mismatched reference trees and MSA files.Some of genomes has same 16S v4 region sequences with difference place in protein tree. But I still don't know why I didn't get any error messages when testing on Linux. Best wishes.
Hey @kawabata-tomoko,
hm, I am a bit confused - let's try to clarify. For your reference sequences, you built an amino acid MSA, and inferred a tree from that, using a protein model (--model LG+F+G4
). Then, you used that MSA to align your query sequences with MAFFT, but your sequences seem to be in nucleotide space instead:
>db29dfd5db5e2501ed9deadabc7dd91d
TACGAAGGGGGCTAGCGTTGCTCGGAATGACTGGGCGTAAAGGGCGTGTAGGCGGTTTGGACAGTCAGATGTGAAATCCCCGGGCTTAACCTGGGAGCTGCATTTGATACGTCCAGACTAGAGTGTGAGAGAGGGTTGTGGAATTCTCAGTGTAGAGGTGAAATTCGTAGATATTGGGAAGAACACCGGTGGCGAAGGCGGCAACCTGGCTCATTACTGACGCTGAGGCGCGAAAGCGTGGGGAGCAAACAGG
Or do you have a second MSA for the same sequences, but containing their nucleotides instead of their amino acids, and aligned to that?
Either way, using the protein model in EPA-ng, when your query sequences (and the MSA?) are actually nucleotides seems like it could cause an error such as Tree Log-Likelihood -INF
.
Please check that you are using compatible data and models :-)
As for
But I still don't know why I didn't get any error messages when testing on Linux.
That is indeed a bit weird - the error message should still be printed, as you got on MacOS. Not sure what's going on there. It could be that EPA-ng was compiled with some older gcc - how did you get or compile EPA-ng? From conda, or did you compile on your own? If the latter, which compiler did you use?
Cheers and so long Lucas
Dear @lczech ,
Thank you for your reply. I apologize for the vagueness of my previous description of the problem. Let me clarify what I did in my workflow.
I used both amino acid and nucleic acid sequences from the same genomes. For each genome, I constructed a reference tree using its protein MSA, and then I used its DNA MSA (with the same name) as a reference MSA for the subsequent sequence insertion process.
To summarize: For each reference genome group:
The DNA sequence that I used for the insertion was not the corresponding nucleic acid sequence of the protein, but another conserved gene from the same genome, which I expected to have the same phylogenetic characteristics. You mentioned that using incorrect models in EPA-ng might cause the errors that I encountered, but the tree itself was built based on protein sequences. Should I set the model to be consistent with the actual tree or should I change the MSA type input during the insertion process? Maybe these operations sound ‘illegal’, but the actual requirement that we have in our research is to insert DNA sequences into phylogenetic trees built from proteins. Would this problem be avoided if I used the corresponding nucleic acid sequence instead of the amino acid sequence to build the initial tree?
For the compilation method, both Linux and MacOS are installed using the conda install -c bioconda epa-ng
command, and I do not actually operate the compilation and running process.
Thank you for your time and assistance.
Sincerely, Tomoko
Hi @kawabata-tomoko,
thanks, that makes more sense now :-)
Should I set the model to be consistent with the actual tree or should I change the MSA type input during the insertion process? Maybe these operations sound ‘illegal’, but the actual requirement that we have in our research is to insert DNA sequences into phylogenetic trees built from proteins. Would this problem be avoided if I used the corresponding nucleic acid sequence instead of the amino acid sequence to build the initial tree?
It is totally fine to use a tree build from protein sequences, if that more accurately reflects the phylogenetic relationship of your data (there is always that issue with gene tree vs species tree...). However, you will still need to use model parameters that are optimized for the type of MSA and query sequence data that you are going to place. See here for how to do that.
In short, in your case, you want to get the model parameters for a DNA model (of your choice, but GTR+G
is usually a good starting point) optimized for your tree and placement MSA. The fact that the tree was originally inferred from protein does not matter at this point. But you cannot use the protein model to place DNA sequences.
Let me know if that helps and solves the issue :-)
Cheers and so long Lucas
Hello Team, When I'm trying to place a sequence into the tree on Linux, I got en error
Aborted (core dumped)
without any error information. And I tried the same data and command on MacOS, core dumped again(withlibc++abi: terminating with uncaught exception of type std::runtime_error: Tree Log-Likelihood -INF!
). the EPA-ng version isEPA-ng v0.3.8
the command used:system infos:
running details infos(MacOS):
the query file(has the same length with MSA sequence after aligned with MAFFT:
It's seem like something wrong with my tree or query, I changed another query to place into another tree and everything is OK. the tree I used in the error situation is build by iqtree2 with command
and the tree is :
After got an error with this tree, I noticed that most nodes have zero confidence. I removed the support , still error. And I build the tree with Fasttree with
fasttree -intree genus_Acetobacter.treefile ../seq/all_seq.fasta> fast.tre
(transform it to binary tree with ete3:tree.resolve_polytomy()
), still error. What can I do to solve this problem? Looking forwards to your reply. Thanks a lot.the MSA file is uploaded as attachment. max_16s_ref.txt