pierrebarbera / epa-ng

Massively parallel phylogenetic placement of genetic sequences
GNU Affero General Public License v3.0
69 stars 7 forks source link

Tree Log-Likelihood -INF Error #50

Open kawabata-tomoko opened 1 year ago

kawabata-tomoko commented 1 year ago

Hello Team, When I'm trying to place a sequence into the tree on Linux, I got en error Aborted (core dumped) without any error information. And I tried the same data and command on MacOS, core dumped again(with libc++abi: terminating with uncaught exception of type std::runtime_error: Tree Log-Likelihood -INF!). the EPA-ng version is EPA-ng v0.3.8 the command used:

epa-ng --tree refer_tree.tree --msa reference_v4.mafft --query query_aligned.fasta --model LG+F+G4 --outdir .

system infos:

Linux:     #76~20.04.1-Ubuntu SMP Mon Mar 20 15:54:19 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
MacOS:     20.6.0 Darwin Kernel Version 20.6.0: Fri Dec 16 00:35:00 PST 2022; root:xnu-7195.141.49~1/RELEASE_X86_64 x86_64

running details infos(MacOS):

% epa-ng --tree refer_tree.tree --msa reference_v4.mafft --query query_aligned.fasta --model LG+F+G4 --outdir .
INFO Selected: Output dir: ./
INFO Selected: Query file: query_aligned.fasta
INFO Selected: Tree file: refer_tree.tree
INFO Selected: Reference MSA: reference_v4.mafft
INFO Selected: Automatic switching of use of per rate scalers
INFO Selected: Preserving the root of the input tree
INFO Selected: Specified model: LG+F+G4
INFO     ______ ____   ___           _   __ ______
        / ____// __ \ /   |         / | / // ____/
       / __/  / /_/ // /| | ______ /  |/ // / __  
      / /___ / ____// ___ |/_____// /|  // /_/ /  
     /_____//_/    /_/  |_|      /_/ |_/ \____/ (v0.3.8)
WARN The reference MSA and tree have differing number of taxa! 196 vs. 186
INFO Using model parameters:
INFO    Rate heterogeneity: GAMMA (4 cats, mean),  alpha: 1 (ML),  weights&rates: (0.25,0.136954) (0.25,0.476752) (0.25,1) (0.25,2.38629) 
        Base frequencies (empirical): 0.250946 0 0 0 0.176994 0 0 0.367461 0 0 0 0 0 0 0 0 0.204599 0 0 0 
        Substitution rates (model): 0.425093 0.276818 0.395144 2.48908 0.969894 1.03855 2.06604 0.358858 0.14983 0.395337 0.536518 1.12403 0.253701 1.17765 4.72718 2.1395 0.180717 0.218959 2.54787 0.751878 0.123954 0.534551 2.80791 0.36397 0.390192 2.4266 0.126991 0.301848 6.32607 0.484133 0.052722 0.332533 0.858151 0.578987 0.593607 0.31444 0.170887 5.07615 0.528768 1.69575 0.541712 1.43765 4.50924 0.191503 0.068427 2.14508 0.371004 0.089525 0.161787 4.00836 2.00068 0.045376 0.612025 0.083688 0.062556 0.523386 5.24387 0.844926 0.927114 0.01069 0.015076 0.282959 0.025548 0.017416 0.394456 1.24028 0.42586 0.02989 0.135107 0.037967 0.084808 0.003499 0.569265 0.640543 0.320627 0.594007 0.013266 0.89368 1.10525 0.075382 2.78448 1.14348 0.670128 1.16553 1.95929 4.12859 0.267959 4.81351 0.072854 0.582457 3.23429 1.67257 0.035855 0.624294 1.22383 1.08014 0.236199 0.257336 0.210332 0.348847 0.423881 0.044265 0.069673 1.80718 0.173735 0.018811 0.419409 0.611973 0.604545 0.077852 0.120037 0.245034 0.311484 0.008705 0.044261 0.296636 0.139538 0.089586 0.196961 1.73999 0.129836 0.268491 0.054679 0.076701 0.108882 0.366317 0.697264 0.442472 0.682139 0.508851 0.990012 0.584262 0.597054 5.30683 0.119013 4.14507 0.159069 4.27361 1.11273 0.078281 0.064105 1.03374 0.11166 0.232523 10.6491 0.1375 6.31236 2.59269 0.24906 0.182287 0.302936 0.619632 0.299648 1.70274 0.656604 0.023918 0.390322 0.748683 1.13686 0.049906 0.131932 0.185202 1.79885 0.099849 0.34696 2.02037 0.696175 0.481306 1.89872 0.094464 0.361819 0.165001 2.45712 7.8039 0.654683 1.33813 0.571468 0.095131 0.089613 0.296501 6.47228 0.248862 0.400547 0.098369 0.140825 0.245841 2.18816 3.15182 0.18951 0.249313
INFO Output file: ./epa_result.jplace
libc++abi: terminating with uncaught exception of type std::runtime_error: Tree Log-Likelihood -INF!
zsh: abort      epa-ng --tree refer_tree.tree --msa reference_v4.mafft --query  --model   .

the query file(has the same length with MSA sequence after aligned with MAFFT:

>db29dfd5db5e2501ed9deadabc7dd91d
TACGAAGGGGGCTAGCGTTGCTCGGAATGACTGGGCGTAAAGGGCGTGTAGGCGGTTTGGACAGTCAGATGTGAAATCCCCGGGCTTAACCTGGGAGCTGCATTTGATACGTCCAGACTAGAGTGTGAGAGAGGGTTGTGGAATTCTCAGTGTAGAGGTGAAATTCGTAGATATTGGGAAGAACACCGGTGGCGAAGGCGGCAACCTGGCTCATTACTGACGCTGAGGCGCGAAAGCGTGGGGAGCAAACAGG

It's seem like something wrong with my tree or query, I changed another query to place into another tree and everything is OK. the tree I used in the error situation is build by iqtree2 with command

iqtree2 -s seq/all_seq.fasta -m LG+F+G4 -pre 'tree/%s_%s' -nt 48 --fast -alrt 1000

and the tree is :

(GCF_001581085.1:0.0001883592,(((((((((((GCF_002153675.1:0.6764403004,GCF_002153735.1:0.0030645544)0:0.0000029411,GCF_002153795.1:0.0158138579)0:0.6605801847,(((((((GCF_008690765.1:0.0000000000,GCF_008690505.1:0.0000000000):0.0000000000,GCF_008690645.1:0.0000000000):0.0000000000,GCF_008690705.1:0.0000000000):0.0000000000,GCF_008690905.1:0.0000000000):0.0000000000,GCF_008690915.1:0.0000000000):0.0000000000,GCF_008704295.1:0.0000000000):0.0416326726,GCF_024158225.1:0.0053982956)0:0.0000024925)0:0.0000024344,((((((((GCF_008690365.1:0.4406791409,GCF_003323795.1:0.4617060122)0:0.1640898965,GCF_024329695.1:0.0262970073)0:0.0003373681,(((GCF_000010845.1:0.6138629501,(GCF_003391275.1:1.6768345925,GCF_014196315.1:0.0139330781)0:0.0000025327)0:0.0000021894,GCF_008704325.1:0.0375132164)0:0.0000029909,((GCF_002153605.1:0.0134270192,GCF_001499615.1:0.4663964356)0:0.0000028317,GCF_008689845.1:0.4877107087)0:0.0033911496)0:0.0000024453)0:0.0000480426,GCF_001953595.1:1.7161662788)0:0.0000026774,(((((GCF_018811985.1:1.2476321189,GCF_001580535.1:0.3701136010)0:0.0000021507,GCF_002153545.1:1.3085919649)0:0.3109927057,GCF_000010905.1:0.0137330848)0:0.0000047088,GCF_001580945.1:0.6034760600)0:0.0000022867,GCF_024158385.1:0.4899289062)0:0.0028710588)0:0.0000028918,GCF_002153745.1:1.6808551804)0:0.0001082810,(GCF_003850905.1:0.0032674677,(((GCF_003966365.1:0.0147410702,GCF_021284605.1:0.5985121941)0:0.0000028764,GCF_001580695.1:0.0350269438)0:0.0005460484,GCF_000613865.1:0.0047823241)0:0.0000164957)0:0.0232318069)0:0.0054273270,(((((((GCF_002153695.1:1.4509147497,(GCF_024158305.1:0.0514046837,(GCF_008365315.1:0.4543791870,(GCF_001642635.1:0.0922419883,GCF_001581035.1:1.6490729713)0:0.0000029322)0:0.0000027350)0:0.1683579951)0:0.0124108208,GCF_002202135.1:0.3419663450)0:0.2244853940,((((GCF_002153775.1:0.2612440122,((GCF_022130865.1:0.4022580284,GCF_024158475.1:1.3897862725)0:0.0247699162,GCF_000379545.1:0.2593021461)0:0.0000027277)100:0.3193728053,(GCF_004341595.1:0.0609295276,(GCF_003850825.1:0.0112367348,GCF_004014775.2:0.4916741059)0:0.0173493126)0:0.0007038507)0:0.0000021411,(GCF_008690725.1:0.4285462033,GCF_011516935.1:1.4206867824)0:0.1558481053)0:0.0000955949,(((((((GCF_001662905.1:0.4844812877,GCF_002005445.1:0.3366328551)0:0.0317392220,GCF_011516945.1:0.2458425154)0:0.2026757393,((GCF_022130785.1:0.0322073465,GCF_000963945.1:0.5333918479)0:0.0000022699,GCF_000193495.2:1.0022380453)0:0.0000024617)0:0.0045456014,GCF_018256985.1:0.0114794035)0:0.0000200495,GCF_024158185.1:0.7045870891)0:0.0000023903,(GCF_000285315.1:0.5450676598,GCF_008689805.1:0.1960126694)0:0.4423247183)0:0.0000948410,GCF_018811975.1:0.0090654175)0:0.0002828809)0:0.0010058575)0:0.0004283380,(((((((GCF_000963965.1:0.0060146282,((GCF_001580915.1:0.0060752079,GCF_007989335.1:0.6741506451)0:0.0000024009,(GCF_000964205.1:0.2689952602,GCF_001766235.1:0.5150130446)0:0.1617072119)0:0.0000021130)0:0.2398539718,GCF_006539325.1:0.2143134267)0:0.0203072357,GCF_001499675.1:0.5146329584)0:0.0000027795,GCF_008689865.1:0.1825394650)0:0.4026651290,GCF_024158265.1:0.0044401363)0:0.0000028173,((GCF_024158205.1:0.0042233156,GCF_011516875.1:0.5034271003)0:0.0000792547,GCF_000285275.1:1.6548804506)0:0.0000020615)0:0.0273314929,(((GCF_024158315.1:0.0070175453,(GCF_001183745.1:0.0063524715,(GCF_000787635.2:0.0065526170,GCF_002276785.1:0.4184158280)0:0.0000028871)0:0.5991026047)0:0.0000022640,(GCF_024158365.1:0.2715677289,GCF_013307325.1:1.3479024265)0:0.3013223189)0:0.0112147574,GCF_002220195.1:0.4824377378)0:0.0000023096)0:0.0000024257)0:0.0000021468,((((((GCF_007989245.1:0.0332987392,((GCF_008704245.1:0.0000000000,GCF_000755665.1:0.0000000000):0.0210429353,GCF_000963925.1:0.4718752717)0:0.0021558859)0:0.0078774398,((GCF_000241585.2:0.0088264104,GCF_021961645.1:0.6286745465)0:0.0000022309,GCF_008704305.1:0.6437612063)0:0.0002032385)0:0.0000020520,GCF_003850885.1:0.5811558117)0:0.0000020541,(((GCF_001580615.1:0.0310270635,GCF_017377745.1:0.0083640209)0:0.0000025252,((GCF_007991075.1:1.1324145009,(GCF_024158505.1:0.2977152696,GCF_003850845.1:1.2075012951)0:0.0000020645)0:0.3300841784,GCF_000755675.1:0.0192709240)0:0.0000022336)0:0.0003230373,GCF_000964225.1:0.6464079685)0:0.0000023058)0:0.0000024507,((((((GCF_000010925.1:0.6580623757,(GCF_018256865.1:0.0729211723,GCF_017377715.1:1.6458223012)0:0.0000028144)0:0.0028590724,(GCF_024158445.1:0.5296455512,GCF_024158425.1:0.1631401068)0:0.0051417052)0:0.0000023336,(GCF_001580995.1:0.0095597996,GCF_011516835.1:0.0096655497)0:0.2897252936)0:0.0000021271,GCF_018256955.1:0.1057967928)0:0.4949788116,(GCF_002173775.1:0.0033052381,GCF_014132135.1:0.0135786395)0:0.0220684424)0:0.0027175137,((GCF_008704285.1:0.5204516769,(GCF_003850945.1:0.0142766333,GCF_002554745.1:1.6439203740)0:0.0000010000)0:0.0000023521,GCF_001581075.1:1.6439085500)0:0.0018296433)0:0.0032985240)0:0.0000024196,GCF_011516735.1:0.0311539844)0:0.0007464121)0:0.0000021260,((((GCF_018256895.1:0.0238122931,GCF_002276555.1:0.3567056482)0:0.0006964877,(((((GCF_022130905.1:0.6759780445,GCF_003850805.1:0.0684026210)0:0.0031813183,(((GCF_006539345.1:0.0117079797,GCF_000429165.1:0.0027619486)0:0.4105163026,GCF_011516925.1:0.2211171337)0:0.0987070937,GCF_009914215.1:0.0889556948)0:0.0000024406)0:0.0002617895,(GCF_024158285.1:0.5527586906,GCF_011516755.1:0.0753710884)0:0.0000020481)0:0.1569666418,GCF_022130805.1:0.6070661649)0:0.1229730883,GCF_002549835.1:0.0191286796)0:0.0000027567)0:0.0000027920,(((GCF_021961685.1:0.4339536548,GCF_000963905.1:0.5731901648)0:0.1371761622,(GCF_018256915.1:0.0068522593,GCF_007991375.1:1.6104189388)0:0.0002093233)0:0.0000025380,GCF_011516865.1:0.5911367130)0:0.0002295569)0:0.0000027232,((GCF_002153475.1:0.3833738686,GCF_000613905.1:0.0057836498)0:0.0000020031,(GCF_002456135.1:0.0089582661,GCF_007989285.1:0.6700584241)0:0.0161515351)0:0.0002918145)0:0.5686823855)0:0.0000025291,(((((((GCF_007989305.1:0.1238860300,GCF_019599335.1:0.2119236220)0:0.2066203913,(((((GCF_000010825.1:0.0000000000,GCF_000010965.1:0.0000000000):0.0000000000,GCF_000010945.1:0.0000000000):0.0000000000,GCF_000010885.1:0.0000000000):0.0000000000,GCF_000010865.1:0.0000000000):0.0814400697,GCF_008689815.1:0.0319721336)0:0.0035268865)0:0.0000021231,GCF_018256975.1:0.6696643419)0:0.0000027511,((((GCF_007991395.1:0.0000000000,GCF_011516825.1:0.0000000000):0.6003152616,GCF_011516885.1:0.0099408109)0:0.0000020243,GCF_002173735.1:0.0370744835)0:0.0019782092,(((GCF_014486685.1:1.5739864716,GCF_011516655.1:0.0221378734)0:0.0000020155,(GCF_002358055.1:0.4205428001,(GCF_001581105.1:0.0591542756,(GCF_002276805.1:0.2130518592,GCF_011516765.1:1.4222825718)0:0.1270049725)0:0.0012018872)0:0.0357890234)0:0.0000120705,((GCF_008704255.1:0.6340700535,GCF_011516725.1:0.0244629949)0:0.0000029892,GCF_002153485.1:1.6164220977)0:0.0000022480)0:0.0001081203)0:0.0095606788)0:0.2193311916,(((GCF_000241625.1:0.6839073885,GCF_018256935.1:0.1832314634)0:0.0067353401,(GCF_018256835.1:0.2030163707,GCF_001581005.1:0.3323390344)0:0.0000029143)0:0.0000027006,GCF_002153575.1:1.3448122176)0:0.0000028216)0:0.3003791125,(GCF_000723785.2:0.0082027889,GCF_014218315.1:0.5230507614)0:0.0000628853)0:0.0000020699,((GCF_008689795.1:0.6260611411,GCF_002723895.1:0.2262062301)0:0.1850061244,(((GCF_018256855.1:0.6672368453,GCF_000193245.1:0.0006517461)0:0.0000026401,(GCF_002156945.1:0.4738437206,GCF_011516745.1:0.0001157176)0:0.0000024761)0:0.0483889382,GCF_002153685.1:0.0741115710)0:0.2248808694)0:0.1757701247)0:0.0000021012)0:0.0000028596)0:0.0029442856)0:0.0000026584,(GCF_024158235.1:0.0241025345,GCF_022130845.1:0.0059523180)0:0.0005193121)0:0.0000021230,GCF_001766255.1:0.5120604596)0:0.0000020516,(((GCF_017377735.1:0.0621445686,GCF_000225485.1:0.3345182353)0:0.0027563893,(GCF_024158325.1:0.6234262367,(GCF_014207635.1:0.0585282647,GCF_003850965.1:0.5727943103)0:0.0000020369)0:0.0026639284)0:0.0000024753,GCF_002153515.1:0.0780511343)0:0.5342060505)0:0.0034632152,(((GCF_019083805.1:0.0001091116,GCF_003850865.1:0.0000540390)0:0.0018101296,GCF_003850925.1:0.4782026931)0:0.0000023029,GCF_000613285.1:0.6558629151)0:0.0000025192)0:0.0000026868,GCF_024158405.1:1.0557610867)0:0.0000020045,GCF_001628715.1:0.0019204407)0:0.4800599364,GCF_002153655.1:0.1172978701)0:0.2140405767,((GCF_002006565.1:0.0894455164,GCF_009295745.1:0.0927561460)0:0.2385422753,GCF_002738225.1:0.0099295595)0:0.0000010000);

After got an error with this tree, I noticed that most nodes have zero confidence. I removed the support , still error. And I build the tree with Fasttree with fasttree -intree genus_Acetobacter.treefile ../seq/all_seq.fasta> fast.tre (transform it to binary tree with ete3:tree.resolve_polytomy()), still error. What can I do to solve this problem? Looking forwards to your reply. Thanks a lot.

the MSA file is uploaded as attachment. max_16s_ref.txt

kawabata-tomoko commented 1 year ago

Dear Team, I forgot that the tree is build from Protein sequence(GTDB markers), but those sequences I need to insert are 16S v4 region sequences which has lots of duplication. It look like the second problem may be caused by mismatched reference trees and MSA files.Some of genomes has same 16S v4 region sequences with difference place in protein tree. But I still don't know why I didn't get any error messages when testing on Linux. Best wishes.

lczech commented 1 year ago

Hey @kawabata-tomoko,

hm, I am a bit confused - let's try to clarify. For your reference sequences, you built an amino acid MSA, and inferred a tree from that, using a protein model (--model LG+F+G4). Then, you used that MSA to align your query sequences with MAFFT, but your sequences seem to be in nucleotide space instead:

>db29dfd5db5e2501ed9deadabc7dd91d
TACGAAGGGGGCTAGCGTTGCTCGGAATGACTGGGCGTAAAGGGCGTGTAGGCGGTTTGGACAGTCAGATGTGAAATCCCCGGGCTTAACCTGGGAGCTGCATTTGATACGTCCAGACTAGAGTGTGAGAGAGGGTTGTGGAATTCTCAGTGTAGAGGTGAAATTCGTAGATATTGGGAAGAACACCGGTGGCGAAGGCGGCAACCTGGCTCATTACTGACGCTGAGGCGCGAAAGCGTGGGGAGCAAACAGG

Or do you have a second MSA for the same sequences, but containing their nucleotides instead of their amino acids, and aligned to that?

Either way, using the protein model in EPA-ng, when your query sequences (and the MSA?) are actually nucleotides seems like it could cause an error such as Tree Log-Likelihood -INF.

Please check that you are using compatible data and models :-)

As for

But I still don't know why I didn't get any error messages when testing on Linux.

That is indeed a bit weird - the error message should still be printed, as you got on MacOS. Not sure what's going on there. It could be that EPA-ng was compiled with some older gcc - how did you get or compile EPA-ng? From conda, or did you compile on your own? If the latter, which compiler did you use?

Cheers and so long Lucas

kawabata-tomoko commented 1 year ago

Dear @lczech ,

Thank you for your reply. I apologize for the vagueness of my previous description of the problem. Let me clarify what I did in my workflow.

I used both amino acid and nucleic acid sequences from the same genomes. For each genome, I constructed a reference tree using its protein MSA, and then I used its DNA MSA (with the same name) as a reference MSA for the subsequent sequence insertion process.

To summarize: For each reference genome group:

The DNA sequence that I used for the insertion was not the corresponding nucleic acid sequence of the protein, but another conserved gene from the same genome, which I expected to have the same phylogenetic characteristics. You mentioned that using incorrect models in EPA-ng might cause the errors that I encountered, but the tree itself was built based on protein sequences. Should I set the model to be consistent with the actual tree or should I change the MSA type input during the insertion process? Maybe these operations sound ‘illegal’, but the actual requirement that we have in our research is to insert DNA sequences into phylogenetic trees built from proteins. Would this problem be avoided if I used the corresponding nucleic acid sequence instead of the amino acid sequence to build the initial tree?

For the compilation method, both Linux and MacOS are installed using the conda install -c bioconda epa-ng command, and I do not actually operate the compilation and running process.

Thank you for your time and assistance.

Sincerely, Tomoko

lczech commented 1 year ago

Hi @kawabata-tomoko,

thanks, that makes more sense now :-)

Should I set the model to be consistent with the actual tree or should I change the MSA type input during the insertion process? Maybe these operations sound ‘illegal’, but the actual requirement that we have in our research is to insert DNA sequences into phylogenetic trees built from proteins. Would this problem be avoided if I used the corresponding nucleic acid sequence instead of the amino acid sequence to build the initial tree?

It is totally fine to use a tree build from protein sequences, if that more accurately reflects the phylogenetic relationship of your data (there is always that issue with gene tree vs species tree...). However, you will still need to use model parameters that are optimized for the type of MSA and query sequence data that you are going to place. See here for how to do that.

In short, in your case, you want to get the model parameters for a DNA model (of your choice, but GTR+G is usually a good starting point) optimized for your tree and placement MSA. The fact that the tree was originally inferred from protein does not matter at this point. But you cannot use the protein model to place DNA sequences.

Let me know if that helps and solves the issue :-)

Cheers and so long Lucas