Closed phiweger closed 9 months ago
I am having hard time reproducing this :/ Can you please send the FASTA file you used to get this error? :)
I just sent it.
Still unable to reproduce:
$ anvi-gen-phylogenomic-tree -f seqs-for-phylogenomics-Viehwege.fa -o phylogenomic-tree.txt
Input aligment file path .....................: /Users/meren/workshop-jena/INFANT-GUT-TUTORIAL/seqs-for-phylogenomics-Viehwege.fa
Output file path .............................: /Users/meren/workshop-jena/INFANT-GUT-TUTORIAL/phylogenomic-tree.txt
Alignment names ..............................: Streptococcus, P_rhinitidis, L_citreum, C_albicans, S_epidermidis, F_magna, P_avidum, E_facealis, S_hominis, Aneorococcus_sp, S_aureus
Alignment sequence length ....................: 8,816
Version ......................................: FastTree Version 2.1.10 No SSE3
Alignment ....................................: standard input
Info .........................................: Amino acid distances: BLOSUM45 Joins: balanced Support: SH-like 1000
Search .......................................: Normal +NNI +SPR (2 rounds range 10) +ML-NNI opt-each=1
TopHits ......................................: 1.00*sqrtN close=default refresh=0.80
ML Model .....................................: Jones-Taylor-Thorton, CAT approximation with 20 rate categories
FastTree output newick file ..................: /Users/meren/workshop-jena/INFANT-GUT-TUTORIAL/phylogenomic-tree.txt
$ cat phylogenomic-tree.txt
((Streptococcus:0.27715,E_facealis:0.10530)0.791:0.03751,(P_avidum:0.60808,L_citreum:0.16850)1.000:0.08870,((S_aureus:0.03558,(S_epidermidis:0.04069,S_hominis:0.04839)0.808:0.01802)1.000:0.08470,(C_albicans:0.43611,(Aneorococcus_sp:0.41781,(F_magna:0.18320,P_rhinitidis:0.17801)0.993:0.04472)0.999:0.05406)1.000:0.08265)1.000:0.06239);
Odd. When I run muscle
followed by FastTree
manually and proceed w/ the following step in the tutorial
anvi-interactive --tree phylogenomic-tree.txt \
-p temp-profile.db \
--title "Pylogenomics of IGD Bins" \
--manual
then all's well. I reinstalled conda install ete3==3.1.1
as it says in anvio's requirements.txt
, still, the error persists.
One observation is that the error about ete3
complaining is thrown very shortly after calling anvi-gen-phylogenomic-tree
, so that muscle
cannot have finished yet. So I guess there really might not be an MSA yet -- could the call to muscle
be the problem?
Can you please run the same command with the flag --debug
? So we can see the Traceback
Sure:
anvi-gen-phylogenomic-tree -f seqs-for-phylogenomics.fa -o phylogenomic-tree.txt --debug
Input aligment file path .....................: .../gone-fishing/INFANT-GUT-TUTORIAL/seqs-for-phylogenomics.fa
Output file path .............................: .../gone-fishing/INFANT-GUT-TUTORIAL/phylogenomic-tree.txt
Alignment names ..............................: P_avidum, F_magna, L_citreum, S_aureus, Aneorococcus_sp, Streptococcus, S_epidermidis, C_albicans, P_rhinitidis, S_hominis, E_facealis
Alignment sequence length ....................: 8,816
Version ......................................: FastTree Version 2.1.10 SSE3
Alignment ....................................: standard input
Info .........................................: Amino acid distances: BLOSUM45 Joins: balanced Support: SH-like 1000
Search .......................................: Normal +NNI +SPR (2 rounds range 10) +ML-NNI opt-each=1
TopHits ......................................: 1.00*sqrtN close=default refresh=0.80
ML Model .....................................: Jones-Taylor-Thorton, CAT approximation with 20 rate categories
Traceback for debugging
================================================================================
File "/usr/local/bin/anvi-gen-phylogenomic-tree", line 70, in <module>
main(args)
File "/usr/local/bin/anvi-gen-phylogenomic-tree", line 52, in main
program().run_command(input_file_path, output_file_path)
File "/usr/local/Cellar/anvio/5.1/libexec/lib/python3.7/site-packages/anvio/drivers/fasttree.py", line 63, in run_command
if filesnpaths.is_proper_newick(output_stdout):
File "/usr/local/Cellar/anvio/5.1/libexec/lib/python3.7/site-packages/anvio/filesnpaths.py", line 57, in is_proper_newick
to say about this: '%s'. Pity :/" % e)
================================================================================
File/Path Error: Your tree doesn't seem to be properly formatted. Here is what ETE had to say
about this: 'Unexisting tree file or Malformed newick tree structure. You may
want to check other newick loading flags like 'format' or 'quoted_node_names'.'.
Pity :/
I was just going through old issues that were not fully addressed and saw this one. I hope it sorted itself out :( thanks for your time to report this and for your followup to help identify the problem. and apologies for not getting back to this earlier.
Hi @meren, I am getting the same error as described in this issue.
os: MacOS Catalina 10.15.4
anvio version
Anvi'o version ...............................: esther (v6.2-master)
Profile DB version ...........................: 32
Contigs DB version ...........................: 14
Pan DB version ...............................: 13
Genome data storage version ..................: 6
Auxiliary data storage version ...............: 2
Structure DB version .........................: 1
This was my original command and the concatenated-proteins.fa
contained all SCGs I found when visualizing the pangenome.
$ anvi-gen-phylogenomic-tree -f concatenated-proteins.fa -o phylogenomic-tree.txt --debug
Input aligment file path .....................: /Users/mschechter/Downloads/concatenated-proteins.fa
Output file path .............................: /Users/mschechter/Downloads/phylogenomic-tree.txt
Alignment names ..............................: genome_1, genome_10, genome_100, genome_101, genome_102, genome_103, genome_104, genome_105, genome_106, genome_107, genome_108, genome_109, genome_11, genome_110, genome_112, genome_113, genome_114, genome_115, genome_116, genome_117, genome_118, genome_119, genome_12, genome_120, genome_121, genome_122, genome_123, genome_124, genome_125, genome_126, genome_128, genome_129, genome_13, genome_130, genome_131, genome_132, genome_133, genome_134, genome_135, genome_136, genome_137, genome_139, genome_14, genome_140, genome_141, genome_143, genome_144, genome_145, genome_146, genome_147, genome_148, genome_15, genome_150, genome_151, genome_153, genome_154, genome_155, genome_157, genome_158, genome_159, genome_16, genome_161, genome_162, genome_163, genome_164, genome_165, genome_166, genome_167, genome_168, genome_169, genome_17, genome_170, genome_171, genome_172, genome_174, genome_175, genome_176, genome_177, genome_178, genome_18, genome_180, genome_183, genome_184, genome_185, genome_186, genome_187, genome_188, genome_189, genome_19, genome_190, genome_191, genome_192, genome_193, genome_194, genome_196, genome_198, genome_199, genome_2, genome_20, genome_200, genome_201, genome_202, genome_203, genome_204, genome_205, genome_206, genome_207, genome_208, genome_209, genome_21, genome_210, genome_211, genome_212, genome_213, genome_214, genome_215, genome_216, genome_217, genome_218, genome_219, genome_22, genome_220, genome_221, genome_222, genome_223, genome_225, genome_226, genome_227, genome_228, genome_229, genome_23, genome_230, genome_231, genome_232, genome_233, genome_234, genome_235, genome_236, genome_238, genome_239, genome_24, genome_240, genome_241, genome_242, genome_243, genome_244, genome_245, genome_246, genome_247, genome_248, genome_249, genome_25, genome_250, genome_251, genome_252, genome_253, genome_254, genome_255, genome_256, genome_257, genome_258, genome_259, genome_260, genome_261, genome_262, genome_263, genome_264, genome_265, genome_266, genome_267, genome_268, genome_269, genome_270, genome_271, genome_273, genome_274, genome_275, genome_276, genome_277, genome_278, genome_279, genome_28, genome_280, genome_281, genome_282, genome_283, genome_285, genome_286, genome_289, genome_29, genome_290, genome_291, genome_292, genome_293, genome_294, genome_295, genome_296, genome_297, genome_298, genome_299, genome_3, genome_30, genome_300, genome_301, genome_303, genome_304, genome_305, genome_306, genome_307, genome_308, genome_309, genome_31, genome_310, genome_311, genome_312, genome_313, genome_314, genome_315, genome_316, genome_317, genome_318, genome_319, genome_32, genome_320, genome_321, genome_323, genome_324, genome_325, genome_326, genome_327, genome_328, genome_329, genome_33, genome_330, genome_331, genome_34, genome_35, genome_36, genome_37, genome_38, genome_39, genome_4, genome_40, genome_41, genome_42, genome_44, genome_45, genome_46, genome_47, genome_48, genome_49, genome_5, genome_50, genome_51, genome_52, genome_53, genome_54, genome_56, genome_57, genome_58, genome_59, genome_60, genome_62, genome_63, genome_64, genome_65, genome_66, genome_67, genome_68, genome_69, genome_7, genome_70, genome_71, genome_72, genome_73, genome_74, genome_75, genome_76, genome_77, genome_78, genome_79, genome_8, genome_80, genome_81, genome_82, genome_83, genome_84, genome_86, genome_87, genome_88, genome_89, genome_9, genome_90, genome_91, genome_92, genome_93, genome_94, genome_95, genome_96, genome_97, genome_98, genome_99, newman_127, usa300_111, usa300_138, usa300_149, usa300_152, usa300_156, usa300_160, usa300_173, usa300_179, usa300_182, usa300_195, usa300_197, usa300_237, usa300_26, usa300_27, usa300_272, usa300_284, usa300_287, usa300_288, usa300_302, usa300_322, usa300_43, usa300_55, usa300_6, usa300_61, usa300_85
Alignment sequence length ....................: 179,052
Version ......................................: FastTree Version 2.1.10 Double precision (No SSE3)
Alignment ....................................: standard input
Info .........................................: Amino acid distances: BLOSUM45 Joins: balanced Support: SH-like 1000
Search .......................................: Normal +NNI +SPR (2 rounds range 10) +ML-NNI opt-each=1
TopHits ......................................: 1.00*sqrtN close=default refresh=0.80
ML Model .....................................: Jones-Taylor-Thorton, CAT approximation with 20 rate categories
Wrong number of characters for genome_1 ......: expected 182504 but have 182502 instead.
Info .........................................: This sequence may be truncated, or another sequence may be too long.
Traceback for debugging
================================================================================
File "/Users/mschechter/github/anvio/bin/anvi-gen-phylogenomic-tree", line 73, in <module>
main(args)
File "/Users/mschechter/github/anvio/bin/anvi-gen-phylogenomic-tree", line 55, in main
program().run_command(input_file_path, output_file_path)
File "/Users/mschechter/github/anvio/anvio/drivers/fasttree.py", line 63, in run_command
if filesnpaths.is_proper_newick(output_stdout):
File "/Users/mschechter/github/anvio/anvio/filesnpaths.py", line 57, in is_proper_newick
"to say about this: '%s'. Pity :/" % e)
================================================================================
File/Path Error: Your tree doesn't seem to be properly formatted. Here is what ETE had to say
about this: 'Unexisting tree file or Malformed newick tree structure. You may
want to check other newick loading flags like 'format' or 'quoted_node_names'.'.
Pity :/
I then went back into the interactive interface and made a new, significantly smaller selection of SCGs (n = 5) and anvi-gen-phylogenomic-tree
worked.
$ anvi-gen-phylogenomic-tree -f concatenated-proteins_small.fa -o phylogenomic-tree.txt
Input aligment file path .....................: /project2/meren/PROJECTS/T7SS/data/raw/20190718_saureus_genomes/02_CONTIGS/concatenated-proteins_small.fa
Output file path .............................: /project2/meren/PROJECTS/T7SS/data/raw/20190718_saureus_genomes/02_CONTIGS/phylogenomic-tree.txt
Alignment names ..............................: genome_1, genome_10, genome_100, genome_101, genome_102, genome_103, genome_104, genome_105, genome_106, genome_107, genome_108, genome_109, genome_11, genome_110, genome_112, genome_113, genome_114, genome_115, genome_116, genome_117, genome_118, genome_119, genome_12, genome_120, genome_121, genome_122, genome_123, genome_124, genome_125, genome_126, genome_128, genome_129, genome_13, genome_130, genome_131, genome_132, genome_133, genome_134, genome_135, genome_136, genome_137, genome_139, genome_14, genome_140, genome_141, genome_143, genome_144, genome_145, genome_146, genome_147, genome_148, genome_15, genome_150, genome_151, genome_153, genome_154, genome_155, genome_157, genome_158, genome_159, genome_16, genome_161, genome_162, genome_163, genome_164, genome_165, genome_166, genome_167, genome_168, genome_169, genome_17, genome_170, genome_171, genome_172, genome_174, genome_175, genome_176, genome_177, genome_178, genome_18, genome_180, genome_183, genome_184, genome_185, genome_186, genome_187, genome_188, genome_189, genome_19, genome_190, genome_191, genome_192, genome_193, genome_194, genome_196, genome_198, genome_199, genome_2, genome_20, genome_200, genome_201, genome_202, genome_203, genome_204, genome_205, genome_206, genome_207, genome_208, genome_209, genome_21, genome_210, genome_211, genome_212, genome_213, genome_214, genome_215, genome_216, genome_217, genome_218, genome_219, genome_22, genome_220, genome_221, genome_222, genome_223, genome_225, genome_226, genome_227, genome_228, genome_229, genome_23, genome_230, genome_231, genome_232, genome_233, genome_234, genome_235, genome_236, genome_238, genome_239, genome_24, genome_240, genome_241, genome_242, genome_243, genome_244, genome_245, genome_246, genome_247, genome_248, genome_249, genome_25, genome_250, genome_251, genome_252, genome_253, genome_254, genome_255, genome_256, genome_257, genome_258, genome_259, genome_260, genome_261, genome_262, genome_263, genome_264, genome_265, genome_266, genome_267, genome_268, genome_269, genome_270, genome_271, genome_273, genome_274, genome_275, genome_276, genome_277, genome_278, genome_279, genome_28, genome_280, genome_281, genome_282, genome_283, genome_285, genome_286, genome_289, genome_29, genome_290, genome_291, genome_292, genome_293, genome_294, genome_295, genome_296, genome_297, genome_298, genome_299, genome_3, genome_30, genome_300, genome_301, genome_303, genome_304, genome_305, genome_306, genome_307, genome_308, genome_309, genome_31, genome_310, genome_311, genome_312, genome_313, genome_314, genome_315, genome_316, genome_317, genome_318, genome_319, genome_32, genome_320, genome_321, genome_323, genome_324, genome_325, genome_326, genome_327, genome_328, genome_329, genome_33, genome_330, genome_331, genome_34, genome_35, genome_36, genome_37, genome_38, genome_39, genome_4, genome_40, genome_41, genome_42, genome_44, genome_45, genome_46, genome_47, genome_48, genome_49, genome_5, genome_50, genome_51, genome_52, genome_53, genome_54, genome_56, genome_57, genome_58, genome_59, genome_60, genome_62, genome_63, genome_64, genome_65, genome_66, genome_67, genome_68, genome_69, genome_7, genome_70, genome_71, genome_72, genome_73, genome_74, genome_75, genome_76, genome_77, genome_78, genome_79, genome_8, genome_80, genome_81, genome_82, genome_83, genome_84, genome_86, genome_87, genome_88, genome_89, genome_9, genome_90, genome_91, genome_92, genome_93, genome_94, genome_95, genome_96, genome_97, genome_98, genome_99, newman_127, usa300_111, usa300_138, usa300_149, usa300_152, usa300_156, usa300_160, usa300_173, usa300_179, usa300_182, usa300_195, usa300_197, usa300_237, usa300_26, usa300_27, usa300_272, usa300_284, usa300_287, usa300_288, usa300_302, usa300_322, usa300_43, usa300_55, usa300_6, usa300_61, usa300_85
Alignment sequence length ....................: 975
Version ......................................: FastTree Version 2.1.10 Double precision (No SSE3)
Alignment ....................................: standard input
Info .........................................: Amino acid distances: BLOSUM45 Joins: balanced Support: SH-like 1000
Search .......................................: Normal +NNI +SPR (2 rounds range 10) +ML-NNI opt-each=1
TopHits ......................................: 1.00*sqrtN close=default refresh=0.80
ML Model .....................................: Jones-Taylor-Thorton, CAT approximation with 20 rate categories
Info .........................................: Ignored unknown character X (seen 576 times)
Refining topology ............................: 22 rounds ME-NNIs, 2 rounds ME-SPRs, 11 rounds ML-NNIs
Info .........................................: Total branch-length 0.078 after 0.18 sec
ML-NNI round 1 ...............................: LogLk = -3545.418 NNIs 15 max delta 5.85 Time 1.76
Info .........................................: Switched to using 20 rate categories (CAT approximation)
Info .........................................: Rate categories were divided by 0.645 so that average rate = 1.0
Info .........................................: CAT-based log-likelihoods may not be comparable across runs
Info .........................................: Use -gamma for approximate but comparable Gamma(20) log-likelihoods
ML-NNI round 2 ...............................: LogLk = -3512.777 NNIs 7 max delta 0.00 Time 3.66
Info .........................................: Turning off heuristics for final round of ML NNIs (converged)
ML-NNI round 3 ...............................: LogLk = -3512.777 NNIs 5 max delta 0.00 Time 4.85 (final)
Optimize all lengths .........................: LogLk = -3512.777 Time 5.20
FastTree output newick file ..................: /project2/meren/PROJECTS/T7SS/data/raw/20190718_saureus_genomes/02_CONTIGS/phylogenomic-tree.txt
Here are the differences in alignment lengths between the input files:
concatenated-proteins.fa
: 179,052
concatenated-proteins_small.fa
: 975
I also attempted to use MUSCLE and FastTree individually with my original concatenated-proteins.fa
but unfortunately could not get passed the alignment step. I am not sure if this information is informative but I just wanted to add it in just to make sure.
$ muscle -in ../concatenated-proteins.fa -out concatenated-proteins.msa
MUSCLE v3.8.1551 by Robert C. Edgar
http://www.drive5.com/muscle
This software is donated to the public domain.
Please cite: Edgar, R.C. Nucleic Acids Res 32(5), 1792-97.
concatenated-proteins 328 seqs, lengths min 178383, max 178587, avg 178512
00:01:04 326 MB(17%) Iter 1 100.00% K-mer dist pass 1
00:01:04 326 MB(17%) Iter 1 100.00% K-mer dist pass 2
Killed04 436 MB(22%) Iter 1 0.31% Align node
Thank you for taking a look and please let me know if you want me to send you any of my files for reproducibility.
Alignment sequence length ....................: 179,052
This is simply too many residues to consider. That's why Mahmoud has implemented functional homogeneity estimates per gene cluster, so you can choose only those gene clusters with meaningful variation (most of them will have functional homogeneity of 1.0, meaning that there is no variation across genes within them) and no alignment issues (i.e., geometric homogeneity > 0.95).
Thanks for the suggestions @meren. I went back and filtered for a group of 70 SCGs using the combined homogeneity index and was successfully able to use anvi-gen-phylogenomic-tree
Hi guys, I'm using anvi'o v7 within the Docker container. I'm trying to build a phylogenetic tree on the concatenated Single Core Gene sequences extracted with :
anvi-get-sequences-for-gene-clusters -g Xac-GENOMES.db -p XacPangenome/XacAnalysis-PAN.db --min-functional-homogeneity-index 1 --min-geometric-homogeneity-index 0.95 --min-num-genomes-gene-cluster-occurs 13 --max-num-genes-from-each-genome 1 --concatenate-gene-clusters -o SCG-H-filtered.fasta
But when I run:
anvi-gen-phylogenomic-tree -f SCG-H-filtered.fasta -o SCG-H-tree
I still getting the famous error from ETE:
Input aligment file path .....................: /home/silviat/Andrea/Xac_pangenome/SCG-H-filtered.fasta Output file path .............................: /home/silviat/Andrea/Xac_pangenome/SCG-H-tree Alignment names ..............................: Xac_301, Xac_A7, Xac_CFBP1159_INRA, Xac_CFBP1159_ZHAW, Xac_CFBP1846, Xac_CFBP2565, Xac_CFBP6600, Xac_IVIA3978, Xac_NCCB100457, Xac_XH2, Xac_XH3, Xac_XH7, Xac_XH8 Alignment sequence length ....................: 509,496 Version ......................................: FastTree Version 2.1.10 Double precision (No SSE3) Alignment ....................................: standard input Info .........................................: Amino acid distances: BLOSUM45 Joins: balanced Support: SH-like 1000 Search .......................................: Normal +NNI +SPR (2 rounds range 10) +ML-NNI opt-each=1 TopHits ......................................: 1.00*sqrtN close=default refresh=0.80 ML Model .....................................: Jones-Taylor-Thorton, CAT approximation with 20 rate categories Wrong number of characters for Xac_301 .......: expected 526407 but have 526397 instead. Info .........................................: This sequence may be truncated, or another sequence may be too long.
File/Path Error: Your tree doesn't seem to be properly formatted. Here is what ETE had to say
about this: 'Unexisting tree file or Malformed newick tree structure. You may
want to check other newick loading flags like 'format' or 'quoted_node_names'.'.
Pity :/
Of course what's saying about Xac_301 (expected 526407 but have 526397 instead) is not really true. Is it problem of large number of SCG genome? I cannot go lower with the number of gene clusters since we are dealing with isolates of the same pathovar.
Any suggestions? Any other alternative way to build a phylogenetic tree on the SCG? Thanks Silvia
Hey @Sirbius, would you please consider using the Docker container for v8
?
Plus,
Alignment sequence length ....................: 509,496
0.5 million nucleotides is a little too much for any meaningful analysis I think :) I think you should consider using these flags instead:
--min-geometric-homogeneity-index 1.0 --min-functional-homogeneity-index 0.95
Yours seem to be the opposite of the best practice.
Hello @meren, I’m working with @Sirbius, could you please explain why you think that a phylogenomic tree built using that large alignment sequence length is meaningless? We though that the more genes we compare among strains the more solid the analysis is, are we wrong? We tried to use the flags with the parameters you suggested, but we still obtain more than 3000 clusters, which is too much for Anvi’o to build a phylogenomic tree. I would like to ask you, since genes which have a 100% sequence identity among them in all the strains analyzed are not explanatory in a phylogenomic study, could have sense to set --max-geometric-homogeneity-index 0.99 in order to exclude all the clusters which contain genes which have the same sequence in all the strains? Is that the sense of the geometric homogeneity index? Setting this parameter we obtain only 265 clusters and we are able to build a phylogenetic tree. Thank you in advance,
Andrea
Hi @MrCorylus, would you mind sharing with me a private download link for the PAN.db and genomes storage (GENOMES.db) via email so I can take look at the data before making a suggestion?
Hi again Andrea,
Thanks for the email.
We tried to use the flags with the parameters you suggested, but we still obtain more than 3000 clusters, which is too much for Anvi’o to build a phylogenomic tree.
I'm sorry, I can see that I've made a mistake in my suggestion. It should've been --max-functional-homogeneity-index 0.95
and not --min-functional-homogeneity-index 0.95
. But when you correct for that you only get 2 gene clusters, which is not very useful. But using these parameters instead,
I was able to get 30 gene clusters, and was able to generate a tree:
I used the interactive interface for convenience, but you should be able to translate my parameters to the command line easily.
I hope this helps.
Best wishes, Meren
Hi,
I am having problems similar to issue #690 related to building a phylogenetic tree.
Housekeeping first:
I installed anvio via
brew
on a Mac HighSierra 10.13.5I am following Murat's tutorial on the infant gut dataset:
Thank you for looking into this.