merenlab / anvio

An analysis and visualization platform for 'omics data
http://merenlab.org/software/anvio
GNU General Public License v3.0
424 stars 145 forks source link

phylogenetic tree build fails #931

Closed phiweger closed 9 months ago

phiweger commented 6 years ago

Hi,

I am having problems similar to issue #690 related to building a phylogenetic tree.

Housekeeping first:

Anvi'o version ...............................: margaret (v5.1)
Profile DB version ...........................: 29
Contigs DB version ...........................: 12
Pan DB version ...............................: 12
Genome data storage version ..................: 6
Auxiliary data storage version ...............: 2
Structure DB version .........................: 1

I installed anvio via brew on a Mac HighSierra 10.13.5

brew tap merenlab/anvio
brew install merenlab/anvio/anvio
anvi-self-test --suite mini
# all fine

I am following Murat's tutorial on the infant gut dataset:

anvi-gen-phylogenomic-tree -f seqs-for-phylogenomics.fa -o phylogenomic-tree.txt

Input aligment file path .....................: .../INFANT-GUT-TUTORIAL/seqs-for-phylogenomics.fa
Output file path .............................: .../INFANT-GUT-TUTORIAL/phylogenomic-tree.txt
Alignment names ..............................: Streptococcus, P_rhinitidis, L_citreum, C_albicans, S_epidermidis, F_magna, P_avidum, E_facealis, S_hominis, Aneorococcus_sp, S_aureus
Alignment sequence length ....................: 8,816
Version ......................................: FastTree Version 2.1.10 SSE3
Alignment ....................................: standard input
Info .........................................: Amino acid distances: BLOSUM45 Joins: balanced Support: SH-like 1000
Search .......................................: Normal +NNI +SPR (2 rounds range 10) +ML-NNI opt-each=1
TopHits ......................................: 1.00*sqrtN close=default refresh=0.80
ML Model .....................................: Jones-Taylor-Thorton, CAT approximation with 20 rate categories

File/Path Error: Your tree doesn't seem to be properly formatted. Here is what ETE had to say
                 about this: 'Unexisting tree file or Malformed newick tree structure. You may
                 want to check other newick loading flags like 'format' or 'quoted_node_names'.'.
                 Pity :/

Thank you for looking into this.

meren commented 6 years ago

I am having hard time reproducing this :/ Can you please send the FASTA file you used to get this error? :)

phiweger commented 6 years ago

I just sent it.

meren commented 6 years ago

Still unable to reproduce:

$ anvi-gen-phylogenomic-tree -f seqs-for-phylogenomics-Viehwege.fa -o phylogenomic-tree.txt
Input aligment file path .....................: /Users/meren/workshop-jena/INFANT-GUT-TUTORIAL/seqs-for-phylogenomics-Viehwege.fa
Output file path .............................: /Users/meren/workshop-jena/INFANT-GUT-TUTORIAL/phylogenomic-tree.txt
Alignment names ..............................: Streptococcus, P_rhinitidis, L_citreum, C_albicans, S_epidermidis, F_magna, P_avidum, E_facealis, S_hominis, Aneorococcus_sp, S_aureus
Alignment sequence length ....................: 8,816
Version ......................................: FastTree Version 2.1.10 No SSE3
Alignment ....................................: standard input
Info .........................................: Amino acid distances: BLOSUM45 Joins: balanced Support: SH-like 1000
Search .......................................: Normal +NNI +SPR (2 rounds range 10) +ML-NNI opt-each=1
TopHits ......................................: 1.00*sqrtN close=default refresh=0.80
ML Model .....................................: Jones-Taylor-Thorton, CAT approximation with 20 rate categories

FastTree output newick file ..................: /Users/meren/workshop-jena/INFANT-GUT-TUTORIAL/phylogenomic-tree.txt

$ cat phylogenomic-tree.txt
((Streptococcus:0.27715,E_facealis:0.10530)0.791:0.03751,(P_avidum:0.60808,L_citreum:0.16850)1.000:0.08870,((S_aureus:0.03558,(S_epidermidis:0.04069,S_hominis:0.04839)0.808:0.01802)1.000:0.08470,(C_albicans:0.43611,(Aneorococcus_sp:0.41781,(F_magna:0.18320,P_rhinitidis:0.17801)0.993:0.04472)0.999:0.05406)1.000:0.08265)1.000:0.06239);
phiweger commented 6 years ago

Odd. When I run muscle followed by FastTree manually and proceed w/ the following step in the tutorial

anvi-interactive --tree phylogenomic-tree.txt \
                 -p temp-profile.db \
                 --title "Pylogenomics of IGD Bins" \
                 --manual

then all's well. I reinstalled conda install ete3==3.1.1 as it says in anvio's requirements.txt, still, the error persists.

One observation is that the error about ete3 complaining is thrown very shortly after calling anvi-gen-phylogenomic-tree, so that muscle cannot have finished yet. So I guess there really might not be an MSA yet -- could the call to muscle be the problem?

meren commented 6 years ago

Can you please run the same command with the flag --debug? So we can see the Traceback

phiweger commented 6 years ago

Sure:

anvi-gen-phylogenomic-tree -f seqs-for-phylogenomics.fa -o phylogenomic-tree.txt --debug

Input aligment file path .....................: .../gone-fishing/INFANT-GUT-TUTORIAL/seqs-for-phylogenomics.fa
Output file path .............................: .../gone-fishing/INFANT-GUT-TUTORIAL/phylogenomic-tree.txt
Alignment names ..............................: P_avidum, F_magna, L_citreum, S_aureus, Aneorococcus_sp, Streptococcus, S_epidermidis, C_albicans, P_rhinitidis, S_hominis, E_facealis
Alignment sequence length ....................: 8,816
Version ......................................: FastTree Version 2.1.10 SSE3
Alignment ....................................: standard input
Info .........................................: Amino acid distances: BLOSUM45 Joins: balanced Support: SH-like 1000
Search .......................................: Normal +NNI +SPR (2 rounds range 10) +ML-NNI opt-each=1
TopHits ......................................: 1.00*sqrtN close=default refresh=0.80
ML Model .....................................: Jones-Taylor-Thorton, CAT approximation with 20 rate categories

Traceback for debugging
================================================================================
  File "/usr/local/bin/anvi-gen-phylogenomic-tree", line 70, in <module>
    main(args)
  File "/usr/local/bin/anvi-gen-phylogenomic-tree", line 52, in main
    program().run_command(input_file_path, output_file_path)
  File "/usr/local/Cellar/anvio/5.1/libexec/lib/python3.7/site-packages/anvio/drivers/fasttree.py", line 63, in run_command
    if filesnpaths.is_proper_newick(output_stdout):
  File "/usr/local/Cellar/anvio/5.1/libexec/lib/python3.7/site-packages/anvio/filesnpaths.py", line 57, in is_proper_newick
    to say about this: '%s'. Pity :/" % e)
================================================================================

File/Path Error: Your tree doesn't seem to be properly formatted. Here is what ETE had to say
                 about this: 'Unexisting tree file or Malformed newick tree structure. You may
                 want to check other newick loading flags like 'format' or 'quoted_node_names'.'.
                 Pity :/
meren commented 5 years ago

I was just going through old issues that were not fully addressed and saw this one. I hope it sorted itself out :( thanks for your time to report this and for your followup to help identify the problem. and apologies for not getting back to this earlier.

mschecht commented 4 years ago

Hi @meren, I am getting the same error as described in this issue.

os: MacOS Catalina 10.15.4

anvio version

Anvi'o version ...............................: esther (v6.2-master)
Profile DB version ...........................: 32
Contigs DB version ...........................: 14
Pan DB version ...............................: 13
Genome data storage version ..................: 6
Auxiliary data storage version ...............: 2
Structure DB version .........................: 1

This was my original command and the concatenated-proteins.fa contained all SCGs I found when visualizing the pangenome.

$ anvi-gen-phylogenomic-tree -f concatenated-proteins.fa -o phylogenomic-tree.txt --debug

Input aligment file path .....................: /Users/mschechter/Downloads/concatenated-proteins.fa
Output file path .............................: /Users/mschechter/Downloads/phylogenomic-tree.txt
Alignment names ..............................: genome_1, genome_10, genome_100, genome_101, genome_102, genome_103, genome_104, genome_105, genome_106, genome_107, genome_108, genome_109, genome_11, genome_110, genome_112, genome_113, genome_114, genome_115, genome_116, genome_117, genome_118, genome_119, genome_12, genome_120, genome_121, genome_122, genome_123, genome_124, genome_125, genome_126, genome_128, genome_129, genome_13, genome_130, genome_131, genome_132, genome_133, genome_134, genome_135, genome_136, genome_137, genome_139, genome_14, genome_140, genome_141, genome_143, genome_144, genome_145, genome_146, genome_147, genome_148, genome_15, genome_150, genome_151, genome_153, genome_154, genome_155, genome_157, genome_158, genome_159, genome_16, genome_161, genome_162, genome_163, genome_164, genome_165, genome_166, genome_167, genome_168, genome_169, genome_17, genome_170, genome_171, genome_172, genome_174, genome_175, genome_176, genome_177, genome_178, genome_18, genome_180, genome_183, genome_184, genome_185, genome_186, genome_187, genome_188, genome_189, genome_19, genome_190, genome_191, genome_192, genome_193, genome_194, genome_196, genome_198, genome_199, genome_2, genome_20, genome_200, genome_201, genome_202, genome_203, genome_204, genome_205, genome_206, genome_207, genome_208, genome_209, genome_21, genome_210, genome_211, genome_212, genome_213, genome_214, genome_215, genome_216, genome_217, genome_218, genome_219, genome_22, genome_220, genome_221, genome_222, genome_223, genome_225, genome_226, genome_227, genome_228, genome_229, genome_23, genome_230, genome_231, genome_232, genome_233, genome_234, genome_235, genome_236, genome_238, genome_239, genome_24, genome_240, genome_241, genome_242, genome_243, genome_244, genome_245, genome_246, genome_247, genome_248, genome_249, genome_25, genome_250, genome_251, genome_252, genome_253, genome_254, genome_255, genome_256, genome_257, genome_258, genome_259, genome_260, genome_261, genome_262, genome_263, genome_264, genome_265, genome_266, genome_267, genome_268, genome_269, genome_270, genome_271, genome_273, genome_274, genome_275, genome_276, genome_277, genome_278, genome_279, genome_28, genome_280, genome_281, genome_282, genome_283, genome_285, genome_286, genome_289, genome_29, genome_290, genome_291, genome_292, genome_293, genome_294, genome_295, genome_296, genome_297, genome_298, genome_299, genome_3, genome_30, genome_300, genome_301, genome_303, genome_304, genome_305, genome_306, genome_307, genome_308, genome_309, genome_31, genome_310, genome_311, genome_312, genome_313, genome_314, genome_315, genome_316, genome_317, genome_318, genome_319, genome_32, genome_320, genome_321, genome_323, genome_324, genome_325, genome_326, genome_327, genome_328, genome_329, genome_33, genome_330, genome_331, genome_34, genome_35, genome_36, genome_37, genome_38, genome_39, genome_4, genome_40, genome_41, genome_42, genome_44, genome_45, genome_46, genome_47, genome_48, genome_49, genome_5, genome_50, genome_51, genome_52, genome_53, genome_54, genome_56, genome_57, genome_58, genome_59, genome_60, genome_62, genome_63, genome_64, genome_65, genome_66, genome_67, genome_68, genome_69, genome_7, genome_70, genome_71, genome_72, genome_73, genome_74, genome_75, genome_76, genome_77, genome_78, genome_79, genome_8, genome_80, genome_81, genome_82, genome_83, genome_84, genome_86, genome_87, genome_88, genome_89, genome_9, genome_90, genome_91, genome_92, genome_93, genome_94, genome_95, genome_96, genome_97, genome_98, genome_99, newman_127, usa300_111, usa300_138, usa300_149, usa300_152, usa300_156, usa300_160, usa300_173, usa300_179, usa300_182, usa300_195, usa300_197, usa300_237, usa300_26, usa300_27, usa300_272, usa300_284, usa300_287, usa300_288, usa300_302, usa300_322, usa300_43, usa300_55, usa300_6, usa300_61, usa300_85
Alignment sequence length ....................: 179,052
Version ......................................: FastTree Version 2.1.10 Double precision (No SSE3)
Alignment ....................................: standard input
Info .........................................: Amino acid distances: BLOSUM45 Joins: balanced Support: SH-like 1000
Search .......................................: Normal +NNI +SPR (2 rounds range 10) +ML-NNI opt-each=1
TopHits ......................................: 1.00*sqrtN close=default refresh=0.80
ML Model .....................................: Jones-Taylor-Thorton, CAT approximation with 20 rate categories
Wrong number of characters for genome_1 ......: expected 182504 but have 182502 instead.
Info .........................................: This sequence may be truncated, or another sequence may be too long.

Traceback for debugging
================================================================================
  File "/Users/mschechter/github/anvio/bin/anvi-gen-phylogenomic-tree", line 73, in <module>
    main(args)
  File "/Users/mschechter/github/anvio/bin/anvi-gen-phylogenomic-tree", line 55, in main
    program().run_command(input_file_path, output_file_path)
  File "/Users/mschechter/github/anvio/anvio/drivers/fasttree.py", line 63, in run_command
    if filesnpaths.is_proper_newick(output_stdout):
  File "/Users/mschechter/github/anvio/anvio/filesnpaths.py", line 57, in is_proper_newick
    "to say about this: '%s'. Pity :/" % e)
================================================================================

File/Path Error: Your tree doesn't seem to be properly formatted. Here is what ETE had to say
                 about this: 'Unexisting tree file or Malformed newick tree structure. You may
                 want to check other newick loading flags like 'format' or 'quoted_node_names'.'.
                 Pity :/

I then went back into the interactive interface and made a new, significantly smaller selection of SCGs (n = 5) and anvi-gen-phylogenomic-tree worked.

$ anvi-gen-phylogenomic-tree -f concatenated-proteins_small.fa -o phylogenomic-tree.txt

Input aligment file path .....................: /project2/meren/PROJECTS/T7SS/data/raw/20190718_saureus_genomes/02_CONTIGS/concatenated-proteins_small.fa
Output file path .............................: /project2/meren/PROJECTS/T7SS/data/raw/20190718_saureus_genomes/02_CONTIGS/phylogenomic-tree.txt
Alignment names ..............................: genome_1, genome_10, genome_100, genome_101, genome_102, genome_103, genome_104, genome_105, genome_106, genome_107, genome_108, genome_109, genome_11, genome_110, genome_112, genome_113, genome_114, genome_115, genome_116, genome_117, genome_118, genome_119, genome_12, genome_120, genome_121, genome_122, genome_123, genome_124, genome_125, genome_126, genome_128, genome_129, genome_13, genome_130, genome_131, genome_132, genome_133, genome_134, genome_135, genome_136, genome_137, genome_139, genome_14, genome_140, genome_141, genome_143, genome_144, genome_145, genome_146, genome_147, genome_148, genome_15, genome_150, genome_151, genome_153, genome_154, genome_155, genome_157, genome_158, genome_159, genome_16, genome_161, genome_162, genome_163, genome_164, genome_165, genome_166, genome_167, genome_168, genome_169, genome_17, genome_170, genome_171, genome_172, genome_174, genome_175, genome_176, genome_177, genome_178, genome_18, genome_180, genome_183, genome_184, genome_185, genome_186, genome_187, genome_188, genome_189, genome_19, genome_190, genome_191, genome_192, genome_193, genome_194, genome_196, genome_198, genome_199, genome_2, genome_20, genome_200, genome_201, genome_202, genome_203, genome_204, genome_205, genome_206, genome_207, genome_208, genome_209, genome_21, genome_210, genome_211, genome_212, genome_213, genome_214, genome_215, genome_216, genome_217, genome_218, genome_219, genome_22, genome_220, genome_221, genome_222, genome_223, genome_225, genome_226, genome_227, genome_228, genome_229, genome_23, genome_230, genome_231, genome_232, genome_233, genome_234, genome_235, genome_236, genome_238, genome_239, genome_24, genome_240, genome_241, genome_242, genome_243, genome_244, genome_245, genome_246, genome_247, genome_248, genome_249, genome_25, genome_250, genome_251, genome_252, genome_253, genome_254, genome_255, genome_256, genome_257, genome_258, genome_259, genome_260, genome_261, genome_262, genome_263, genome_264, genome_265, genome_266, genome_267, genome_268, genome_269, genome_270, genome_271, genome_273, genome_274, genome_275, genome_276, genome_277, genome_278, genome_279, genome_28, genome_280, genome_281, genome_282, genome_283, genome_285, genome_286, genome_289, genome_29, genome_290, genome_291, genome_292, genome_293, genome_294, genome_295, genome_296, genome_297, genome_298, genome_299, genome_3, genome_30, genome_300, genome_301, genome_303, genome_304, genome_305, genome_306, genome_307, genome_308, genome_309, genome_31, genome_310, genome_311, genome_312, genome_313, genome_314, genome_315, genome_316, genome_317, genome_318, genome_319, genome_32, genome_320, genome_321, genome_323, genome_324, genome_325, genome_326, genome_327, genome_328, genome_329, genome_33, genome_330, genome_331, genome_34, genome_35, genome_36, genome_37, genome_38, genome_39, genome_4, genome_40, genome_41, genome_42, genome_44, genome_45, genome_46, genome_47, genome_48, genome_49, genome_5, genome_50, genome_51, genome_52, genome_53, genome_54, genome_56, genome_57, genome_58, genome_59, genome_60, genome_62, genome_63, genome_64, genome_65, genome_66, genome_67, genome_68, genome_69, genome_7, genome_70, genome_71, genome_72, genome_73, genome_74, genome_75, genome_76, genome_77, genome_78, genome_79, genome_8, genome_80, genome_81, genome_82, genome_83, genome_84, genome_86, genome_87, genome_88, genome_89, genome_9, genome_90, genome_91, genome_92, genome_93, genome_94, genome_95, genome_96, genome_97, genome_98, genome_99, newman_127, usa300_111, usa300_138, usa300_149, usa300_152, usa300_156, usa300_160, usa300_173, usa300_179, usa300_182, usa300_195, usa300_197, usa300_237, usa300_26, usa300_27, usa300_272, usa300_284, usa300_287, usa300_288, usa300_302, usa300_322, usa300_43, usa300_55, usa300_6, usa300_61, usa300_85
Alignment sequence length ....................: 975
Version ......................................: FastTree Version 2.1.10 Double precision (No SSE3)
Alignment ....................................: standard input
Info .........................................: Amino acid distances: BLOSUM45 Joins: balanced Support: SH-like 1000
Search .......................................: Normal +NNI +SPR (2 rounds range 10) +ML-NNI opt-each=1
TopHits ......................................: 1.00*sqrtN close=default refresh=0.80
ML Model .....................................: Jones-Taylor-Thorton, CAT approximation with 20 rate categories
Info .........................................: Ignored unknown character X (seen 576 times)
Refining topology ............................: 22 rounds ME-NNIs, 2 rounds ME-SPRs, 11 rounds ML-NNIs
Info .........................................: Total branch-length 0.078 after 0.18 sec
ML-NNI round 1 ...............................: LogLk = -3545.418 NNIs 15 max delta 5.85 Time 1.76
Info .........................................: Switched to using 20 rate categories (CAT approximation)
Info .........................................: Rate categories were divided by 0.645 so that average rate = 1.0
Info .........................................: CAT-based log-likelihoods may not be comparable across runs
Info .........................................: Use -gamma for approximate but comparable Gamma(20) log-likelihoods
ML-NNI round 2 ...............................: LogLk = -3512.777 NNIs 7 max delta 0.00 Time 3.66
Info .........................................: Turning off heuristics for final round of ML NNIs (converged)
ML-NNI round 3 ...............................: LogLk = -3512.777 NNIs 5 max delta 0.00 Time 4.85 (final)
Optimize all lengths .........................: LogLk = -3512.777 Time 5.20

FastTree output newick file ..................: /project2/meren/PROJECTS/T7SS/data/raw/20190718_saureus_genomes/02_CONTIGS/phylogenomic-tree.txt

Here are the differences in alignment lengths between the input files: concatenated-proteins.fa: 179,052 concatenated-proteins_small.fa: 975

I also attempted to use MUSCLE and FastTree individually with my original concatenated-proteins.fa but unfortunately could not get passed the alignment step. I am not sure if this information is informative but I just wanted to add it in just to make sure.

$ muscle -in ../concatenated-proteins.fa -out concatenated-proteins.msa

MUSCLE v3.8.1551 by Robert C. Edgar

http://www.drive5.com/muscle
This software is donated to the public domain.
Please cite: Edgar, R.C. Nucleic Acids Res 32(5), 1792-97.

concatenated-proteins 328 seqs, lengths min 178383, max 178587, avg 178512
00:01:04   326 MB(17%)  Iter   1  100.00%  K-mer dist pass 1
00:01:04   326 MB(17%)  Iter   1  100.00%  K-mer dist pass 2
Killed04   436 MB(22%)  Iter   1    0.31%  Align node

Thank you for taking a look and please let me know if you want me to send you any of my files for reproducibility.

meren commented 4 years ago
Alignment sequence length ....................: 179,052

This is simply too many residues to consider. That's why Mahmoud has implemented functional homogeneity estimates per gene cluster, so you can choose only those gene clusters with meaningful variation (most of them will have functional homogeneity of 1.0, meaning that there is no variation across genes within them) and no alignment issues (i.e., geometric homogeneity > 0.95).

mschecht commented 4 years ago

Thanks for the suggestions @meren. I went back and filtered for a group of 70 SCGs using the combined homogeneity index and was successfully able to use anvi-gen-phylogenomic-tree

Sirbius commented 9 months ago

Hi guys, I'm using anvi'o v7 within the Docker container. I'm trying to build a phylogenetic tree on the concatenated Single Core Gene sequences extracted with :

anvi-get-sequences-for-gene-clusters -g Xac-GENOMES.db -p XacPangenome/XacAnalysis-PAN.db --min-functional-homogeneity-index 1 --min-geometric-homogeneity-index 0.95 --min-num-genomes-gene-cluster-occurs 13 --max-num-genes-from-each-genome 1 --concatenate-gene-clusters -o SCG-H-filtered.fasta

But when I run: anvi-gen-phylogenomic-tree -f SCG-H-filtered.fasta -o SCG-H-tree I still getting the famous error from ETE:

Input aligment file path .....................: /home/silviat/Andrea/Xac_pangenome/SCG-H-filtered.fasta Output file path .............................: /home/silviat/Andrea/Xac_pangenome/SCG-H-tree Alignment names ..............................: Xac_301, Xac_A7, Xac_CFBP1159_INRA, Xac_CFBP1159_ZHAW, Xac_CFBP1846, Xac_CFBP2565, Xac_CFBP6600, Xac_IVIA3978, Xac_NCCB100457, Xac_XH2, Xac_XH3, Xac_XH7, Xac_XH8 Alignment sequence length ....................: 509,496 Version ......................................: FastTree Version 2.1.10 Double precision (No SSE3) Alignment ....................................: standard input Info .........................................: Amino acid distances: BLOSUM45 Joins: balanced Support: SH-like 1000 Search .......................................: Normal +NNI +SPR (2 rounds range 10) +ML-NNI opt-each=1 TopHits ......................................: 1.00*sqrtN close=default refresh=0.80 ML Model .....................................: Jones-Taylor-Thorton, CAT approximation with 20 rate categories Wrong number of characters for Xac_301 .......: expected 526407 but have 526397 instead. Info .........................................: This sequence may be truncated, or another sequence may be too long.

File/Path Error: Your tree doesn't seem to be properly formatted. Here is what ETE had to say
about this: 'Unexisting tree file or Malformed newick tree structure. You may
want to check other newick loading flags like 'format' or 'quoted_node_names'.'. Pity :/

Of course what's saying about Xac_301 (expected 526407 but have 526397 instead) is not really true. Is it problem of large number of SCG genome? I cannot go lower with the number of gene clusters since we are dealing with isolates of the same pathovar.

Any suggestions? Any other alternative way to build a phylogenetic tree on the SCG? Thanks Silvia

meren commented 9 months ago

Hey @Sirbius, would you please consider using the Docker container for v8?

Plus,

Alignment sequence length ....................: 509,496

0.5 million nucleotides is a little too much for any meaningful analysis I think :) I think you should consider using these flags instead:

--min-geometric-homogeneity-index 1.0 --min-functional-homogeneity-index 0.95

Yours seem to be the opposite of the best practice.

MrCorylus commented 9 months ago

Hello @meren, I’m working with @Sirbius, could you please explain why you think that a phylogenomic tree built using that large alignment sequence length is meaningless? We though that the more genes we compare among strains the more solid the analysis is, are we wrong? We tried to use the flags with the parameters you suggested, but we still obtain more than 3000 clusters, which is too much for Anvi’o to build a phylogenomic tree. I would like to ask you, since genes which have a 100% sequence identity among them in all the strains analyzed are not explanatory in a phylogenomic study, could have sense to set --max-geometric-homogeneity-index 0.99 in order to exclude all the clusters which contain genes which have the same sequence in all the strains? Is that the sense of the geometric homogeneity index? Setting this parameter we obtain only 265 clusters and we are able to build a phylogenetic tree. Thank you in advance,

Andrea

meren commented 9 months ago

Hi @MrCorylus, would you mind sharing with me a private download link for the PAN.db and genomes storage (GENOMES.db) via email so I can take look at the data before making a suggestion?

meren commented 9 months ago

Hi again Andrea,

Thanks for the email.

We tried to use the flags with the parameters you suggested, but we still obtain more than 3000 clusters, which is too much for Anvi’o to build a phylogenomic tree.

I'm sorry, I can see that I've made a mistake in my suggestion. It should've been --max-functional-homogeneity-index 0.95 and not --min-functional-homogeneity-index 0.95. But when you correct for that you only get 2 gene clusters, which is not very useful. But using these parameters instead,

image

I was able to get 30 gene clusters, and was able to generate a tree:

image

I used the interactive interface for convenience, but you should be able to translate my parameters to the command line easily.

I hope this helps.

Best wishes, Meren