merenlab / anvio

An analysis and visualization platform for 'omics data
http://merenlab.org/software/anvio
GNU General Public License v3.0
415 stars 142 forks source link

[BUG] Can't create phylogenomic tree, as phylogenomic-tree.txt and default collection has different number of items. #2292

Closed CarolineSchouNielsen closed 2 weeks ago

CarolineSchouNielsen commented 2 weeks ago

Phylogenomic-tree.txt and default collection has different number of items

Anvi'o version

anvi-self-test --version

`Anvi'o .......................................: marie (v8) Python .......................................: 3.10.14

Profile database .............................: 38 Contigs database .............................: 21 Pan database .................................: 16 Genome data storage ..........................: 7 Auxiliary data storage .......................: 2 Structure database ...........................: 2 Metabolic modules database ...................: 4 tRNA-seq database ............................: 2`

System info

I installed Anvi'o with the following commands (probably a lot more information than needed, but I am unsure of what to write as I did not make these installation instructions myself):

Screenshot 2024-07-03 at 12 42 45

Detailed description of the issue

Hi!

I was following your A primer on Anvi'o tutorial (with the infant gut data), but at Chapter 3: phylogenomics, I am having trouble visualising the phylogenomic tree with the additional information set in the default collection.

I made the default collection with the following command:

anvi-get-sequences-for-hmm-hits -c CONTIGS.db \ -p PROFILE.db \ -o seqs-for-phylogenomics.fa \ --hmm-source Bacteria_71 \ -C default \ --gene-names Ribosomal_L1,Ribosomal_L2,Ribosomal_L3,Ribosomal_L4,Ribosomal_L5,Ribosomal_L6 \ --concatenate-genes \ --return-best-hit \ --get-aa-sequences

Which gave this output:

Screenshot 2024-07-03 at 12 36 52

Then, I try to visualise the tree with this command:

anvi-interactive -p PROFILE.db \ -c CONTIGS.db \ -C default \ --tree phylogenomic-tree.txt

But instead of getting the tree, I get this error message, saying that the number of items in my collection and tree have to match:

Screenshot 2024-07-03 at 12 39 19

How and which items do I remove to make the tree?

Thank you! :)

ivagljiva commented 2 weeks ago

Since you are using a collection of 13 genomes for extracting the HMM sequences to make the tree, in theory your tree file should have 13 items in it, unless some of the genomes dropped out while filtering. You can request genomes to drop out explicitly using the --max-num-genes-missing-from-bin parameter, but here it seems to have happened without that request (and without telling you about it).

I tested this part of the tutorial on my own computer with anvi'o v8, and I was able to reproduce the error. The output of anvi-get-sequences-for-hmm-hits contains only sequences for 11 genomes in the collection:

$ grep -c '>' seqs-for-phylogenomics.fa
11

So two of the bins are dropping out. Cross-checking the list of bin names in the default collection with those in the sequence headers of seqs-for-phylogenomics.fa reveals that P_acnes and S_lugdunensis are missing from the latter.

Then I tried the same thing using anvio-dev. And this time, I got a nice warning output from the anvi-get-sequences-for-hmm-hits program:

Num bins that lacked the 6 genes in `--gene-names` : 2

Bins that are no more in the analysis ........: P_acnes, S_lugdunensis

Filtered hits ................................: 64 hits remain after filtering for 6 gene(s)

WARNING
===============================================
You requested only the best hits to be reported, which means, if, say, there are
more than one RecA hits in a bin for a given HMM source, only the one with the
lowest e-value will be kept, and others will be removed from your final results.

Filtered hits ................................: 56 hits remain after removing weak hits for multiple genes

CONDOLENCES FROM DAAN TO YOU FOR YOUR BINS THAT WENT 💀
===============================================
Please note that a total of 2 bins from your final results were removed while
anvi'o was applying all the filters to your HMM hits. Please carefully review
the logs above to make sure you have a good grasp on what happened, and you're
happy with the reults.

So the same thing is happening there, but at least in the development version we are told about what is going on :) Those two bins don't have any of the 6 genes that we asked for, so rather than representing them in the output file with sequences exclusively made up of gap characters, anvi'o removed them completely. Thus, the tree doesn't contain the 2 genomes, and trying to visualize your collection of 13 genomes with a tree of only 11 genomes fails.

I think the behavior of this program is reasonable, so it is not a bug. Rather, the Infant Gut Tutorial should be updated to reflect what is happening here. In fact, if you look carefully at the screenshots of the tree in the IGT, it has only 11 genomes in it.

To get around the error and visualize that tree as shown in the IGT, you need to remove from the collection the two bins that are missing from the tree:

anvi-export-collection -p PROFILE.db -C default
grep -v P_acnes collection-default.txt | grep -v S_lugdunensis > collection-default-reduced.txt
anvi-import-collection -c CONTIGS.db -p PROFILE.db -C default_reduced collection-default-reduced.txt

And then you can visualize the tree with the reduced set of genomes:

anvi-interactive -p PROFILE.db \
                 -c CONTIGS.db \
                 -C default_reduced \
                 --tree phylogenomic-tree.txt

Voila:

image

I will update the IGT to include these steps. Thanks for bringing this to our attention, @CarolineSchouNielsen.

ivagljiva commented 2 weeks ago

Tutorial has been updated :)

CarolineSchouNielsen commented 2 weeks ago

Thank you @ivagljiva! It works for me too :)