merenlab / anvio

An analysis and visualization platform for 'omics data
http://merenlab.org/software/anvio
GNU General Public License v3.0
413 stars 142 forks source link

Allow external genomes input to anvi-estimate-scg-taxonomy #2249

Closed ivagljiva closed 2 months ago

ivagljiva commented 2 months ago

This PR addresses #1689, as requested by @FlorianTrigodet and the members of the Bigelow Institute's single-cell genomics workshop (two years in a row, sorry for the delay).

I added only --external-genomes input to the program, since internal genomes are much more tricky (you would need to write new functions to extract only the SCG hits within a given bin). But external genomes was easy to add; I simply copied the existing strategy for working with multiple metagenomes, which enforces certain flags and then initializes/runs an instance of the EstimatorSingle class for each individual genome (and gathers the output afterwards).

For external genomes, --metagenome-mode is enforced to be off, and we don't allow the use of profile db input (and downstream features that depend on profile dbs) since those doesn't get read by the GenomeDescriptions class when you load an external genomes file.

I tested on a small external genomes file containing one B. adolescentis genome and one E. faecalis genome, and it worked. Here is the species-level matrix output as evidence :)

taxon   E_faecalis_6240 isolate
Bifidobacterium adolescentis    0.00    1.00
Enterococcus faecalis   1.00    0.00

Side note, when I tried to test on the Infant Gut Tutorial to make sure that I didn't break anything with the metagenomes file input, I got this cryptic error (The codebase is not ready to handle this :(). The comments in the codebase that this was because we supposedly don't allow the use of merged profile databases in the case of multiple metagenome input:

(from line ~673 of scg.py)

# NOTE: what is happening down below might look stupid, because it really is. items of `d` here
            # contain a dictionary with the key 'coverages' where the coverage value of a given gene or bin
            # is kept per sample. to simplify downstream data wrangling operations, here we replace the
            # coverages dictionary with a single 'coverage': value entry to `d`. The
            # only reason we can do this here is becasue we only allow the user to use single profiles
            # with their metagenome descriptions and we are certain that there will be only a single entry in the
            # coverages dictionary.

Which is weird, because you can definitely run this program on the Infant Gut Tutorial when you use the single db input option, despite the profile being merged. Confusing. Maybe @meren has some insight on whether this is something that needs to be fixed for the --metagenomes-file input case.

Regardless, there is no sanity check in the class loading the metagenome descriptions that enforces single profile dbs, which means that everyone who tries this with multiple metagenomes and their merged profiles gets this very cryptic error. So I just updated the error text to be a bit more specific about what is going on. Still, I think that this should probably be changed to allow the use of merged profiles -- If you can do something with the single estimator class, you should be able to also do it with the multi-estimator class.

Anyway, currently hunting down a single profile db for a metagenome that I can use to make sure outputs are the same between the old version of the code and this branch.

ivagljiva commented 2 months ago

Okay, I just found a gut metagenome with a single profile database, and tested the --metagenomes input to this program using both this branch and the master branch. No difference in the output file, so I guess this is good to go :)

meren commented 2 months ago

This looks great, Iva. Thank you very much for taking a stab at this.

I tested it with the cases available to me, and it worked perfectly, except one minor concern I had:

When we run the following, we get a report, and a text file,

anvi-estimate-scg-taxonomy -c E_faecalis_6240.db -o TAX.txt
Contigs DB ...................................: E_faecalis_6240.db
Metagenome mode ..............................: False

Estimated taxonomy for "E_faecalis_6240"
===============================================
+-----------------+--------------+-------------------+-----------------------------------------------------------------------------------------------------------+
|                 |   total_scgs |   supporting_scgs | taxonomy                                                                                                  |
+=================+==============+===================+===========================================================================================================+
| E_faecalis_6240 |            1 |                 1 | Bacteria / Bacillota / Bacilli / Lactobacillales / Enterococcaceae / Enterococcus / Enterococcus faecalis |
+-----------------+--------------+-------------------+-----------------------------------------------------------------------------------------------------------+

Output file ..................................: TAX.txt

Which looks like this:

$ cat TAX.txt
bin_name    total_scgs  supporting_scgs t_domain    t_phylum    t_class t_order t_family    t_genus t_species
E_faecalis_6240 1   1   Bacteria    Bacillota   Bacilli Lactobacillales Enterococcaceae Enterococcus    Enterococcus faecalis

Which is an extremely convenient format (since, if you had multiple genomes from which you actually have generate a pangenome, you could literally import this file into a pan-db to show taxonomy as layer data since the bin_name will be a unique key). But when I run this, of course it doesn't give me an output file for multiple genomes:

anvi-estimate-scg-taxonomy -e external-genomes.txt -o TAX.txt

Config Error: When using SCG taxonomy estimation in this mode, you must provide an output file
              prefix rather than an output file path. Anvi'o will use your prefix and will
              generate many files that start with that prefix but ends with different names
              for each taxonomic level.

Just wanted to mention in case you would like to take a stab at this while your hands are dirty with this part of the code today :)

And regarding your other note: the difference between single and merged profile databases is a bit complicated, and if I remember correctly there was a very valid reason to not allow merged-profile-db files in metagenome-txt. But I may have to revisit that at some point. Thank you for bringing it up!

ivagljiva commented 2 months ago

Thanks, Meren :)

Just wanted to mention in case you would like to take a stab at this

I did it :) Now when I run

anvi-estimate-scg-taxonomy -e input_txt_files/external_genomes.txt -o TAX.txt

I get the following output in TAX.txt:

name    total_scgs  supporting_scgs t_domain    t_phylum    t_class t_order t_family    t_genus t_species
TEST_ISOLATE    22  12  Bacteria    Actinomycetota  Actinomycetia   Actinomycetales Bifidobacteriaceae  Bifidobacterium Bifidobacterium adolescentis
Enterococcus_faecalis_6240  22  20  Bacteria    Bacillota   Bacilli Lactobacillales Enterococcaceae Enterococcus    Enterococcus faecalis

You can also get both output types at once, if you want:

anvi-estimate-scg-taxonomy -e input_txt_files/external_genomes.txt -o TAX.txt -O TAX

To implement it, I ended up basically recreating the existing function (only a bit simpler) that prints this output for the single-db case (TaxonomyEstimatorSingle.store_items_taxonomy_super_dict()) within the multi estimator class. So it is a case of two functions in different places in the codebase that do basically the same thing, but unfortunately it was necessary, because utilizing the existing function would have required more extensive changes. The existing function currently requires a single contigs database (because it calls get_print_friendly_items_taxonomy_super_dict to get the db name), so getting an instance of this class and just running that function doesn't work unless I were to start modifying TaxonomyEstimatorSingle. So here we are :/

ivagljiva commented 2 months ago

Could you please test the new output type with your data and let me know if it works as you want? :) thanks!

meren commented 2 months ago

Just tested! Looks great! Thank you so much, Iva!!! :)

FlorianTrigodet commented 2 months ago

Tested and it works beautifully! Using 226 contigs db, the -O flag takes ~30s and the -o flag only takes 3s! The improvement is massive compared to my old bash loop, which took 8min!