Closed ivagljiva closed 2 months ago
Okay, I just found a gut metagenome with a single profile database, and tested the --metagenomes
input to this program using both this branch and the master
branch. No difference in the output file, so I guess this is good to go :)
This looks great, Iva. Thank you very much for taking a stab at this.
I tested it with the cases available to me, and it worked perfectly, except one minor concern I had:
When we run the following, we get a report, and a text file,
anvi-estimate-scg-taxonomy -c E_faecalis_6240.db -o TAX.txt
Contigs DB ...................................: E_faecalis_6240.db
Metagenome mode ..............................: False
Estimated taxonomy for "E_faecalis_6240"
===============================================
+-----------------+--------------+-------------------+-----------------------------------------------------------------------------------------------------------+
| | total_scgs | supporting_scgs | taxonomy |
+=================+==============+===================+===========================================================================================================+
| E_faecalis_6240 | 1 | 1 | Bacteria / Bacillota / Bacilli / Lactobacillales / Enterococcaceae / Enterococcus / Enterococcus faecalis |
+-----------------+--------------+-------------------+-----------------------------------------------------------------------------------------------------------+
Output file ..................................: TAX.txt
Which looks like this:
$ cat TAX.txt
bin_name total_scgs supporting_scgs t_domain t_phylum t_class t_order t_family t_genus t_species
E_faecalis_6240 1 1 Bacteria Bacillota Bacilli Lactobacillales Enterococcaceae Enterococcus Enterococcus faecalis
Which is an extremely convenient format (since, if you had multiple genomes from which you actually have generate a pangenome, you could literally import this file into a pan-db to show taxonomy as layer data since the bin_name
will be a unique key). But when I run this, of course it doesn't give me an output file for multiple genomes:
anvi-estimate-scg-taxonomy -e external-genomes.txt -o TAX.txt
Config Error: When using SCG taxonomy estimation in this mode, you must provide an output file
prefix rather than an output file path. Anvi'o will use your prefix and will
generate many files that start with that prefix but ends with different names
for each taxonomic level.
Just wanted to mention in case you would like to take a stab at this while your hands are dirty with this part of the code today :)
And regarding your other note: the difference between single and merged profile databases is a bit complicated, and if I remember correctly there was a very valid reason to not allow merged-profile-db files in metagenome-txt. But I may have to revisit that at some point. Thank you for bringing it up!
Thanks, Meren :)
Just wanted to mention in case you would like to take a stab at this
I did it :) Now when I run
anvi-estimate-scg-taxonomy -e input_txt_files/external_genomes.txt -o TAX.txt
I get the following output in TAX.txt
:
name total_scgs supporting_scgs t_domain t_phylum t_class t_order t_family t_genus t_species
TEST_ISOLATE 22 12 Bacteria Actinomycetota Actinomycetia Actinomycetales Bifidobacteriaceae Bifidobacterium Bifidobacterium adolescentis
Enterococcus_faecalis_6240 22 20 Bacteria Bacillota Bacilli Lactobacillales Enterococcaceae Enterococcus Enterococcus faecalis
You can also get both output types at once, if you want:
anvi-estimate-scg-taxonomy -e input_txt_files/external_genomes.txt -o TAX.txt -O TAX
To implement it, I ended up basically recreating the existing function (only a bit simpler) that prints this output for the single-db case (TaxonomyEstimatorSingle.store_items_taxonomy_super_dict()
) within the multi estimator class. So it is a case of two functions in different places in the codebase that do basically the same thing, but unfortunately it was necessary, because utilizing the existing function would have required more extensive changes. The existing function currently requires a single contigs database (because it calls get_print_friendly_items_taxonomy_super_dict
to get the db name), so getting an instance of this class and just running that function doesn't work unless I were to start modifying TaxonomyEstimatorSingle
. So here we are :/
Could you please test the new output type with your data and let me know if it works as you want? :) thanks!
Just tested! Looks great! Thank you so much, Iva!!! :)
Tested and it works beautifully! Using 226 contigs db, the -O
flag takes ~30s and the -o
flag only takes 3s!
The improvement is massive compared to my old bash loop, which took 8min!
This PR addresses #1689, as requested by @FlorianTrigodet and the members of the Bigelow Institute's single-cell genomics workshop (two years in a row, sorry for the delay).
I added only
--external-genomes
input to the program, since internal genomes are much more tricky (you would need to write new functions to extract only the SCG hits within a given bin). But external genomes was easy to add; I simply copied the existing strategy for working with multiple metagenomes, which enforces certain flags and then initializes/runs an instance of theEstimatorSingle
class for each individual genome (and gathers the output afterwards).For external genomes,
--metagenome-mode
is enforced to be off, and we don't allow the use of profile db input (and downstream features that depend on profile dbs) since those doesn't get read by theGenomeDescriptions
class when you load an external genomes file.I tested on a small external genomes file containing one B. adolescentis genome and one E. faecalis genome, and it worked. Here is the species-level matrix output as evidence :)
Side note, when I tried to test on the Infant Gut Tutorial to make sure that I didn't break anything with the metagenomes file input, I got this cryptic error (
The codebase is not ready to handle this :(
). The comments in the codebase that this was because we supposedly don't allow the use of merged profile databases in the case of multiple metagenome input:(from line ~673 of
scg.py
)Which is weird, because you can definitely run this program on the Infant Gut Tutorial when you use the single db input option, despite the profile being merged. Confusing. Maybe @meren has some insight on whether this is something that needs to be fixed for the
--metagenomes-file
input case.Regardless, there is no sanity check in the class loading the metagenome descriptions that enforces single profile dbs, which means that everyone who tries this with multiple metagenomes and their merged profiles gets this very cryptic error. So I just updated the error text to be a bit more specific about what is going on. Still, I think that this should probably be changed to allow the use of merged profiles -- If you can do something with the single estimator class, you should be able to also do it with the multi-estimator class.
Anyway, currently hunting down a single profile db for a metagenome that I can use to make sure outputs are the same between the old version of the code and this branch.