merenlab / anvio

An analysis and visualization platform for 'omics data
http://merenlab.org/software/anvio
GNU General Public License v3.0
413 stars 142 forks source link

Updating the databases for anvi'o SCG taxonomy #2210

Closed meren closed 5 months ago

meren commented 5 months ago

Not only updating the databases, but also the way we're dealing with them.

More about this in #2211.

mschecht commented 5 months ago

Ran this to confirm metagenome-mode is working correctly with this commit:

cd INFANT-GUT-TUTORIAL/additional-files/pangenomics

# Test on single contigs-db - This should throw an error because they don't have the new data
$ anvi-estimate-scg-taxonomy -c external-genomes/Enterococcus_faecalis_6255.db   --metagenome-mode --scg-name-for-metagenome-mode Ribosomal_S15 -o asdf
Config Error: The SCG taxonomy database version on your computer (GTDB: v214.1; Anvi'o: v1) is
              different than the SCG taxonomy database version to populate your contigs
              database (v95). Please re-run the program `anvi-run-scg-taxonomy` on your
              contigs-db.

# Test is on multiple contigs-dbs
head -n 4 external-genomes.txt > external-genomes-small.txt

# This should throw an error
$ anvi-estimate-scg-taxonomy -M external-genomes-small.txt  --metagenome-mode --scg-name-for-metagenome-mode Ribosomal_S15 -O asdf
Config Error: The SCG taxonomy database version on your computer (GTDB: v214.1; Anvi'o: v1) is
              different than the SCG taxonomy database version to populate your contigs
              database (v95). Please re-run the program `anvi-run-scg-taxonomy` on your
              contigs-db.

# update one of the contigs-dbs and see what happens
anvi-run-scg-taxonomy -c external-genomes/Enterococcus_faecalis_6240.db -T 5

# Still catches the error woohoo! :)
$ anvi-estimate-scg-taxonomy -M external-genomes-small.txt  --metagenome-mode --scg-name-for-metagenome-mode Ribosomal_S15 -O asdf
Config Error: The SCG taxonomy database version on your computer (GTDB: v214.1; Anvi'o: v1) is
              different than the SCG taxonomy database version to populate your contigs
              database (v95). Please re-run the program `anvi-run-scg-taxonomy` on your
              contigs-db.

# Update them all and see what happens

for genome in ` tail -n +2 external-genomes-small.txt | cut -f 2`; do anvi-run-scg-taxonomy -c $genome -T 6; done

$ anvi-estimate-scg-taxonomy -M external-genomes-small.txt  --metagenome-mode --scg-name-for-metagenome-mode Ribosomal_S15 -O asdf
Num metagenomes ..............................: 3
Taxonomic level of interest ..................: (None specified by the user, so 'all levels')
Output file prefix ...........................: asdf
Output in matrix format ......................: False
Output raw data ..............................: False
SCG coverages will be computed? ..............: False
SCG [chosen by the user] .....................: Ribosomal_S15

* Your metagenome file DOES NOT contain profile databases, but you asked anvi'o to
  estimate SCG taxonomy in metagenome mode. So be it. SCG name is set to
  Ribosomal_S15.

Long-format output ...........................: asdf-LONG-FORMAT.txt
# and it works!
ivagljiva commented 5 months ago

I also tested on the Infant Gut Dataset (the main assembly in metagenome mode), as well as one of my single contigs db test files, and can confirm it works perfectly :)

meren commented 5 months ago

Thank you for catching that bug, @mschecht, and thank you for testing it further, @ivagljiva.

I am merging it now and we will deal with the fallout in master :p

ivagljiva commented 5 months ago

A friendly user on Discord identified a bug with this PR that I have also confirmed in my installation.

If you run anvi-setup-scg-taxonomy --reset, the program deletes the directory which contains the new SCG search databases:

 if os.path.exists(self.ctx.SCGs_taxonomy_data_dir):
            if self.reset:
                shutil.rmtree(self.ctx.SCGs_taxonomy_data_dir)
                self.run.warning('The existing directory for SCG taxonomy data dir has been removed. Just so you know.')
                filesnpaths.gen_output_directory(self.ctx.SCGs_taxonomy_data_dir)

It used to be that we used --reset to download sequences directly from GTDB, but now the search databases ship with anvi'o, so we don't need the --reset functionality at all anymore. In fact, there were only three references to the self.reset variable remaining in the scg.py code: 1) reading the argument, 2) a sanity check against using both --reset and --redo-databases, and 3) the above code for deleting the directory.

I think if we remove the --reset option entirely, this will be resolved. @meren, is there any reason to keep --reset? I couldn't find any other classes being called that might use this parameter, but I may have missed something.

If so, I have a commit ready to go to fix this bug :)

ivagljiva commented 5 months ago

See 4eab829195700e3c4fe6964b876151a778b33344 and 06a5769a0478ef8e7003c073daaa36660039fd75 for the fixes

ivagljiva commented 5 months ago

Furthermore, it seems the parameter --redo-databases is no longer used. So I got rid of it too :). (commit 2c1a6af3bf22d2199ae18e835058af7211fef7b0 )

ivagljiva commented 4 months ago

(the additional commits are now merged to master as of 71818e1bc57d3d810604012df9aabe7111b02597 )