Testing and improving the SCG taxonomy branch

meren commented 4 years ago

While this issue contains specific requests for @ozcan and @ekiefl down below, anyone who wish to test the new taxonomy branch can follow this recipe.

It will require you to switch to anvio-taxonomy branch, and download and prepare the following test data (alternatively, you can use your own data and update this issue if something catches your eye):

# make sure you are in the anvio-taxonomy branch:
git checkout anvio-taxonomy
git pull

# download and unpack the data
wget https://ndownloader.figshare.com/files/12898007 -O INFANT-GUT-TUTORIAL.tar.gz
tar -zxvf INFANT-GUT-TUTORIAL.tar.gz && cd INFANT-GUT-TUTORIAL

# migrate databases, run things that need to be run
anvi-migrate-db *db
anvi-run-hmms -c CONTIGS.db -T 6
anvi-run-scg-taxonomy -c CONTIGS.db -T 3 -P 3
anvi-import-collection -p PROFILE.db -c CONTIGS.db additional-files/collections/merens.txt -C default

Once you are here, you have everything ready: an up-to-date contigs database with taxonomy and a profile database with a collection. The following commands should work, and you should feel free to apply them to datasets you are working with.

Taxonomy for a contigs database:

anvi-estimate-taxonomy -c CONTIGS.db

Taxonomy for a contigs database in metagenome mode:

anvi-estimate-taxonomy -c CONTIGS.db \
                       --metagenome-mode

Taxonomy for each bin in a collection:

anvi-estimate-taxonomy -c CONTIGS.db \
                       -p PROFILE.db \
                       -C default

Taxonomy for a contigs database in metagenome mode with SCG coverages:

anvi-estimate-taxonomy -c CONTIGS.db \
                       -p PROFILE.db \
                       --metagenome-mode \
                       --compute-scg-coverages

@ozcan, I changed the bottle route to work with the new design properly. but since the data structure has changed in the process, the interface side needs to be fixed. I have multiple requests.

[x] it is important to make sure that this feature is only initialized when we are in the right mode. i.e., not in collection mode, not in pan mode, not in manual mode, etc.
[x] it shouldn't run whenever someone loads the interface with some default collection. let's add a checkbox in the Main tab right above "Show additional settings" (when the mode is appropriate) that says "Estimate taxonomy automatically". If checked, anvi'o estimates taxonomy every time the user makes a selection. It is deselected by default.
[x] The button in the Bins tab reads "Recalculate / Show Taxonomy for Bins" and opens a table of taxonomy and shows something similar to this:

@ekiefl, as we discussed before, everything should be ready for you to take a look at our heuristics. for the example above, you can use the following two commands to see how things respond to your changes to set a better consensus taxonomy:

# first level: setting consensus at the level of individual hits
anvi-run-scg-taxonomy -c CONTIGS.db --debug

# second level: setting consensus at the level of bins:
anvi-estimate-taxonomy -c CONTIGS.db \
                       -p PROFILE.db \
                       -C default \
                       --debug

@efogarty11 can send you additional contigs databases for more comprehensive tests if you like.

Thank you all very much for your help in advance.

@ozcan and @ekiefl, please let me know about your timeline. I think this would be a good addition to v6.

ekiefl commented 4 years ago

thanks @meren. I will attempt to finish something by the end of this week.

ozcan commented 4 years ago

I will do my part this week @meren

efogarty11 commented 4 years ago

Absolutely! I'll plan to come up with 5 high bacteroides and 5 low bacteroides @meren, unless there's something different you'd like :)

ivagljiva commented 4 years ago

I am pretty sure that this branch was not meant to be used in our pangenomics workflow. But I am curious, so I will try it.

I am testing this out on an SFB pangenome I made previously, in which I was checking a putative SFB MAG against 3 complete SFB genomes and also Clostridium perfringens. I'm sure the taxonomy estimation will work for the contigs databases of each of these, but I am fairly sure it will not show up when I run anvi-display-pan. But, if all goes well, I hope to see the three SFB genomes labeled as "Candidatus arthromitus" (the old designation) or "Candidatus savagella" (the new designation), and the C. perfringens and putative SFB bin labeled as "Clostridium perfringens".

Here is my working directory on my laptop: /Users/iva/Lab/merenlab/SFB/SFB_BIN_PANGENOME I switched to the taxonomy branch and migrated all my DBs like so:

anvi-migrate-db *.db
anvi-migrate-db WITH_C_PERFRINGENS/WITH_C_PERFRINGENS-PAN.db
anvi-migrate-db 02_CONTIGS/*.db

Running HMMs on all my genomes and the bin:

for DB in $(ls 02_CONTIGS/); do anvi-run-hmms -c 02_CONTIGS/$DB -T 6; done

THIS IS SO FAST NOW! THANK YOU, OZCAN!

Running single-copy core gene taxonomy on each one:

for DB in $(ls 02_CONTIGS/); do anvi-run-scg-taxonomy -c 02_CONTIGS/$DB -T 3 -P 3; done

I estimated taxonomy:

for DB in $(ls 02_CONTIGS/); do anvi-estimate-taxonomy -c 02_CONTIGS/$DB; done

It appears to be working beautifully. I see these tables come up, like this:

Estimated taxonomy for "GCA_001655775"
===============================================
╒═══════════════╤══════════════╤═══════════════════╤════════════════════════════════════════════════════════════════════════════════════╕
│               │   total_scgs │   supporting_scgs │ taxonomy                                                                           │
╞═══════════════╪══════════════╪═══════════════════╪════════════════════════════════════════════════════════════════════════════════════╡
│ GCA_001655775 │           22 │                22 │ Bacteria / Firmicutes_A / Clostridia / Clostridiales / Savagellaceae / Savagella / │
╘═══════════════╧══════════════╧═══════════════════╧════════════════════════════════════════════════════════════════════════════════════╛
Estimated taxonomy for "Putative_SFB"
===============================================
╒══════════════╤══════════════╤═══════════════════╤═══════════════════════════════════════════════════════════════════════════════════════════════════════════════════╕
│              │   total_scgs │   supporting_scgs │ taxonomy                                                                                                          │
╞══════════════╪══════════════╪═══════════════════╪═══════════════════════════════════════════════════════════════════════════════════════════════════════════════════╡
│ Putative_SFB │           22 │                22 │ Bacteria / Firmicutes_A / Clostridia / Clostridiales / Clostridiaceae / Clostridium_P / Clostridium_P perfringens │
╘══════════════╧══════════════╧═══════════════════╧═══════════════════════════════════════════════════════════════════════════════════════════════════════════════════╛

The taxonomy is being estimated correctly. In fact, it corroborates my hypothesis that the putative SFB bin is Clostridium perfringens. :)

According to Ozcan, I would have to re-make my genomes storage DB to see the taxonomy when I display my pangenome. I can do that:

anvi-gen-genomes-storage -e external-genomes.txt -o SFB_BIN_CHECK_TAXONOMY-GENOMES.db
anvi-pan-genome -g SFB_BIN_CHECK_TAXONOMY-GENOMES.db -n SFB_BIN_CHECK_TAXONOMY

That all works perfectly. However, if I try to display it, I get an error:

iva$ anvi-display-pan -g SFB_BIN_CHECK_TAXONOMY -p SFB_BIN_CHECK_TAXONOMY/SFB_BIN_CHECK_TAXONOMY-PAN.db 
Interactive mode .............................: pan                                                                                                                                  
Traceback (most recent call last):                                                                                                                                                   
  File "/Users/iva/software/anvio/bin/anvi-display-pan", line 76, in <module>
    d = interactive.Interactive(args)
  File "/Users/iva/software/anvio/anvio/interactive.py", line 209, in __init__
    self.load_pan_mode()
  File "/Users/iva/software/anvio/anvio/interactive.py", line 904, in load_pan_mode
    PanSuperclass.__init__(self, self.args)
  File "/Users/iva/software/anvio/anvio/dbops.py", line 971, in __init__
    progress=self.progress)
  File "/Users/iva/software/anvio/anvio/genomestorage.py", line 54, in __init__
    self.db = db.DB(self.storage_path, self.version, new_database=create_new)
  File "/Users/iva/software/anvio/anvio/db.py", line 46, in __init__
    self.conn = sqlite3.connect(self.db_path)
sqlite3.OperationalError: unable to open database file

So here's what I've learned:

the taxonomy estimation commands work perfectly on contigs DBs of single genomes / of a single bin. And this is quite fast, too.
if you estimate taxonomy on contigs DBs that are then used to generate a pangenome, this does not fail until you try to display the pangenome.

I think this second point hearkens to Meren's note above:

it is important to make sure that this feature is only initialized when we are in the right mode. i.e., not in collection mode, not in pan mode, not in manual mode, etc.

But I am very pleasantly surprised by the speed and accuracy at which this estimated the taxonomy of my genomes. :)

meren commented 4 years ago

New feature: taxonomy for layers

With the commits above I just added a new functionality to add taxonomy layers to profile databases.

Previously we could do this:

anvi-estimate-taxonomy -c CONTIGS.db \
                       -p PROFILE.db \
                       --metagenome-mode \
                       --compute-scg-coverages

and get back something like this:

Now, we have an additional flag that adds this information to the profile database in a form that can be visualized through the interactive interface. Here is an example:

anvi-estimate-taxonomy -c CONTIGS.db \
                       -p PROFILE.db \
                       --metagenome-mode \
                       --compute-scg-coverages \
                       --update-profile-db-with-taxonomy

When the user runs this, anvi'o uses the most frequent SCG in the contigs database to calculate everything and add the information to the profile database. In this case with the Infant Gut Dataset anvi'o happens to choose Ribosomal_S6. Let's also run it with another SCG we choose before visualizing things:

anvi-estimate-taxonomy -c CONTIGS.db \
                       -p PROFILE.db \
                       --metagenome-mode \
                       --compute-scg-coverages \
                       --update-profile-db-with-taxonomy \
                       --scg-name-for-metagenome-mode Ribosomal_S8

Fine. Now running the anvi-interactive:

anvi-interactive -p PROFILE.db -c CONTIGS.db

And there is a surprise in the layers panel:

And if you were to choose two:

This is what you see on the display:

Colors don't match because they are randomly assigned, but you can tell what is what since the proportions are pretty similar.

🐡

meren commented 4 years ago

@ozcan, this doesn't look good since the taxonomy text does not fit into that cell:

is it possible to do add it as another tr>td with a colspan of 6? or can we add it as a div underneath the tr?

ozcan commented 4 years ago

Great suggestion @meren, Yes indeed expanding it to the entire row would look much better

meren commented 4 years ago

Another quick point; when there is no scg taxonomy instance, the interface could avoid showing anything:

ekiefl commented 4 years ago

I get this error following your workflow:

(anvio-master) ▶▶ anvi-run-scg-taxonomy -c CONTIGS.db -T 3 -P 3
Num relevant SCGs in contigs db ..............: 212
Taxonomy .....................................: /Users/evan/Software/anvio/anvio/data/misc/SCG_TAXONOMY/GTDB/ACCESSION_TO_TAXONOMY.txt
Database reference ...........................: /Users/evan/Software/anvio/anvio/data/misc/SCG_TAXONOMY/GTDB/SCG_SEARCH_DATABASES
Number of SCGs ...............................: 22

Parameters for DIAMOND blastp
===============================================
Max number of target sequences ...............: 20
Min bit score to report alignments ...........: 90
Num aligment tasks running in parallel .......: 3
Num CPUs per aligment task ...................: 3

WARNING
===============================================
Contents of the table "scg_taxonomy" is being removed

[04 Oct 19 11:24:51 Computing SCGs aligments] Initializing 3 process...                                                                                              ETA: ∞:∞:∞Process Process-3:
Process Process-2:
Process Process-4:
Traceback (most recent call last):
  File "/Users/evan/.pyenv/versions/3.6.1/lib/python3.6/multiprocessing/process.py", line 249, in _bootstrap
    self.run()
  File "/Users/evan/.pyenv/versions/3.6.1/lib/python3.6/multiprocessing/process.py", line 93, in run
    self._target(*self._args, **self._kwargs)
  File "/Users/evan/Software/anvio/anvio/scgtaxonomyops.py", line 1319, in blast_search_scgs_worker
    gene_callers_id = int(fields[0])
ValueError: invalid literal for int() with base 10: 'Error: Database was built with a different version of diamond as is incompatible.'
Traceback (most recent call last):
  File "/Users/evan/.pyenv/versions/3.6.1/lib/python3.6/multiprocessing/process.py", line 249, in _bootstrap
    self.run()
  File "/Users/evan/.pyenv/versions/3.6.1/lib/python3.6/multiprocessing/process.py", line 93, in run
    self._target(*self._args, **self._kwargs)
  File "/Users/evan/Software/anvio/anvio/scgtaxonomyops.py", line 1319, in blast_search_scgs_worker
    gene_callers_id = int(fields[0])
ValueError: invalid literal for int() with base 10: 'Error: Database was built with a different version of diamond as is incompatible.'
Traceback (most recent call last):
  File "/Users/evan/.pyenv/versions/3.6.1/lib/python3.6/multiprocessing/process.py", line 249, in _bootstrap
    self.run()
  File "/Users/evan/.pyenv/versions/3.6.1/lib/python3.6/multiprocessing/process.py", line 93, in run
    self._target(*self._args, **self._kwargs)
  File "/Users/evan/Software/anvio/anvio/scgtaxonomyops.py", line 1319, in blast_search_scgs_worker
    gene_callers_id = int(fields[0])
ValueError: invalid literal for int() with base 10: 'Error: Database was built with a different version of ()diamond as is incompatible.'

meren commented 4 years ago

@ekiefl, I am almost thinking about keeping a diamond binary in the database to avoid this. can someone please stop me? :(

meren commented 4 years ago

I am almost thinking about keeping a diamond binary in the database to avoid this. can someone please stop me? :(

@ozcan did stop me. of course we can't put a binary diamond in the codebase unless we can do it for every single architecture. stupid me. our convo revealed another idea, though:

[x] we will only ship the FASTA files, and not the binary database files, and the databases are going to be generated at the user side with the diamond installed on their system.

meren commented 4 years ago

we will only ship the FASTA files, and not the binary database files, and the databases are going to be generated at the user side with the diamond installed on their system.

This is now done.

meren commented 4 years ago

Ok. With with the previous commit, which builds on what @ozcan had started on Friday, the taxonomy window is ready as far as I am concerned.

The original goal was to recapitulate this in the interface:

And this is how it looks now:

But even better, when you click on one of the bin names, booyakasha:

If there are no protests, I will merge this to master very soon.

meren commented 4 years ago

Hey @ozcan, there are two minor things remain to finalize the taxonomy branch :) The first one is this:

[x] When the collection autoloads, there are no completion / redundancy estimates except the first one. If you click the button for taxonomy estimations everything goes to normal.
[x] The second thing is in the Taxonomy table (the last screenshot in the previous message) is sortable, but it gives a funny error when you click the headers. There is a single-line solution on the interwebs, but I didn't now exactly where to put it in the file :(
[ ] When there is no auto loaded collection, the "show taxonomy table" button remains disabled even when "estimate taxonomy" checkbox is selected.

Thank you very much!

meren commented 4 years ago

OK. I was wrong. There are a couple more things that need to be taken care of.

[x] When you select too many contigs from many populations, the terminal says "Config Error: You've hit some uncharted area", but the interface still puts a taxon name underneath the bin. It shouldn't

merenlab / anvio

Testing and improving the SCG taxonomy branch #1240

New feature: taxonomy for layers