merenlab / anvio

An analysis and visualization platform for 'omics data
http://merenlab.org/software/anvio
GNU General Public License v3.0
432 stars 145 forks source link

[BUG] anvi-estimate-scg-taxonomy crashes when external-genomes file contains contigs-db with old scg names #2225

Closed mschecht closed 7 months ago

mschecht commented 7 months ago

Short description of the problem

anvi-estimate-scg-taxonomy crashes when external-genomes file contains contigs-db with old scg names.

anvi'o version

$ anvi-self-test --version

Anvi'o .......................................: marie (v8-dev)
Python .......................................: 3.10.13

Profile database .............................: 40
Contigs database .............................: 22
Pan database .................................: 17
Genome data storage ..........................: 7
Auxiliary data storage .......................: 2
Structure database ...........................: 2
Metabolic modules database ...................: 4
tRNA-seq database ............................: 2

System info

macOS 13.2.1 (22D68)

Detailed description of the issue

Hi anvi'o team,

I am running anvi-estimate-scg-taxonomy with an external genomes file and a specific SCG. Everything is excellent until the program runs into a contigs-db that contains SCGs from the older version of the program.

I dove into the code and here is the issue: contigs-dbs make it this far into the code because they have the updated scg-taxonomy self attribute "GTDB: v214.1; Anvi'o: v1" even though anvi-run-scg-taxonomy. I think there should be a sanity check while processing the external-genomes.txt to skip a contigs-db that does not have target SCG needed for the anvi-estimate-scg-taxonomy.

I made a quick test below to demonstrate the issue.

Cheers, Matt

Files / commands to reproduce the issue

Here is a link to download these contigs-dbs: https://drive.google.com/drive/folders/1vY59vrhhW69TH-43W0MV8kBMe_Wfy9Jd?usp=sharing

cat <<EOF > fake-external-genomes.txt
name    contigs_db_path
test_1  ORAL_P-B-F_Bin_00093.db
test_2  ORAL_P-C-M_Bin_00098.db
test_3  ORAL_T-D-F_MAG_00012.db
EOF

$ anvi-estimate-scg-taxonomy -M fake-external-genomes.txt \
                           --metagenome-mode \
                           --scg-name-for-metagenome-mode Ribosomal_L19 \
                           --raw-output \
                           -O output_txt.tsv

Traceback (most recent call last):
  File "/Users/mschechter/github/anvio/bin/anvi-estimate-scg-taxonomy", line 104, in <module>
    main(args)
  File "/Users/mschechter/github/anvio/anvio/terminal.py", line 915, in wrapper
    program_method(*args, **kwargs)
  File "/Users/mschechter/github/anvio/bin/anvi-estimate-scg-taxonomy", line 39, in main
    t.estimate()
  File "/Users/mschechter/github/anvio/anvio/taxonomyops/scg.py", line 447, in estimate
    scg_taxonomy_super_dict_multi = self.get_scg_taxonomy_super_dict_for_metagenomes()
  File "/Users/mschechter/github/anvio/anvio/taxonomyops/scg.py", line 937, in get_scg_taxonomy_super_dict_for_metagenomes
    scg_taxonomy_super_dict[metagenome_name] = SCGTaxonomyEstimatorSingle(args, progress=progress_quiet, run=run_quiet).get_items_taxonomy_super_dict()
  File "/Users/mschechter/github/anvio/anvio/taxonomyops/scg.py", line 999, in __init__
    TaxonomyEstimatorSingle.__init__(self, skip_init=skip_init)
  File "/Users/mschechter/github/anvio/anvio/taxonomyops/__init__.py", line 175, in __init__
    self.init()
  File "/Users/mschechter/github/anvio/anvio/taxonomyops/__init__.py", line 179, in init
    self.init_items_data()
  File "/Users/mschechter/github/anvio/anvio/taxonomyops/__init__.py", line 271, in init_items_data
    self.item_name_to_gene_caller_id_dict[item_gene_name].add(gene_callers_id)
KeyError: 'Ribosomal_L9_C'
meren commented 7 months ago

I think the real solution is to increase the version of contigs-db, and write a migration script that simply removes the SCG taxonomy results :/ it will be annoying to anyone who has been using the main branch, but it will be the most reliable way to fix all future hiccups.

mschecht commented 7 months ago

Good call - does there need to be multiple migration tasks or is this enough to write a new migration script?

meren commented 7 months ago

Your intuition was right, and a new migration script along with an update in version number was enough. I've made some changes in your version now, please test it, and feel free to migrate it when you're ready :)

mschecht commented 7 months ago

Thanks for the code updates @meren!

Here is the PR: #2226

Here is a successful test I ran with IGD:

cd INFANT-GUT-TUTORIAL

# migrate all external-genomes
for db in `ls additional-files/pangenomics/external-genomes*.db`; do anvi-migrate $db --migrate-safely; done

# run scg-taxonomy
anvi-run-scg-taxonomy -c additional-files/pangenomics/external-genomes/Enterococcus_faecalis_6512.db --num-threads 1

# print table updates
query="SELECT * FROM scg_taxonomy;"
$ sqlite3 additional-files/pangenomics/external-genomes/Enterococcus_faecalis_6512.db "$query"
2066|Ribosomal_L1|CONSENSUS|100.0|Bacteria|Bacillota|Bacilli|Lactobacillales|Enterococcaceae|Enterococcus|Enterococcus faecalis
2464|Ribosomal_L13|CONSENSUS|100.0|Bacteria|Bacillota|Bacilli|Lactobacillales|Enterococcaceae|Enterococcus|Enterococcus faecalis
156|Ribosomal_L14|CONSENSUS|100.0|Bacteria|Bacillota|Bacilli|Lactobacillales|Enterococcaceae|Enterococcus|Enterococcus faecalis
153|Ribosomal_L16|CONSENSUS|100.0|Bacteria|Bacillota|Bacilli|Lactobacillales|Enterococcaceae|Enterococcus|
174|Ribosomal_L17|CONSENSUS|100.0|Bacteria|Bacillota|Bacilli|Lactobacillales|Enterococcaceae|Enterococcus|Enterococcus faecalis
1556|Ribosomal_L19|CONSENSUS|100.0|Bacteria|Bacillota|Bacilli|Lactobacillales|Enterococcaceae|Enterococcus|
633|Ribosomal_L20|CONSENSUS|100.0|Bacteria|Bacillota|Bacilli|Lactobacillales|Enterococcaceae|Enterococcus|Enterococcus faecalis
692|Ribosomal_L21p|CONSENSUS|100.0|Bacteria|Bacillota|Bacilli|Lactobacillales|Enterococcaceae|Enterococcus|Enterococcus faecalis
151|Ribosomal_L22|CONSENSUS|100.0|Bacteria|Bacillota|Bacilli|Lactobacillales|Enterococcaceae|Enterococcus|Enterococcus faecalis
158|Ribosomal_L5|CONSENSUS|100.0|Bacteria|Bacillota|Bacilli|Lactobacillales|Enterococcaceae|Enterococcus|Enterococcus faecalis
149|Ribosomal_L2|CONSENSUS|100.0|Bacteria|Bacillota|Bacilli|Lactobacillales|Enterococcaceae|Enterococcus|Enterococcus faecalis
166|Ribosomal_L27A|CONSENSUS|100.0|Bacteria|Bacillota|Bacilli|Lactobacillales|Enterococcaceae|Enterococcus|Enterococcus faecalis
146|Ribosomal_L3|CONSENSUS|100.0|Bacteria|Bacillota|Bacilli|Lactobacillales|Enterococcaceae|Enterococcus|Enterococcus faecalis
147|Ribosomal_L4|CONSENSUS|100.0|Bacteria|Bacillota|Bacilli|Lactobacillales|Enterococcaceae|Enterococcus|Enterococcus faecalis
172|Ribosomal_S11|CONSENSUS|100.0|Bacteria|Bacillota|Bacilli|Lactobacillales|Enterococcaceae|Enterococcus|Enterococcus faecalis
2321|Ribosomal_S15|CONSENSUS|100.0|Bacteria|Bacillota|Bacilli|Lactobacillales|Enterococcaceae|Enterococcus|Enterococcus faecalis
1400|Ribosomal_S16|CONSENSUS|100.0|Bacteria|Bacillota|Bacilli|Lactobacillales|Enterococcaceae|Enterococcus|Enterococcus faecalis
1818|Ribosomal_S2|CONSENSUS|100.0|Bacteria|Bacillota|Bacilli|Lactobacillales|Enterococcaceae|Enterococcus|Enterococcus faecalis
6|Ribosomal_S6|CONSENSUS|100.0|Bacteria|Bacillota|Bacilli|Lactobacillales|Enterococcaceae|Enterococcus|Enterococcus faecalis
140|Ribosomal_S7|CONSENSUS|100.0|Bacteria|Bacillota|Bacilli|Lactobacillales|Enterococcaceae|Enterococcus|Enterococcus faecalis
160|Ribosomal_S8|CONSENSUS|100.0|Bacteria|Bacillota|Bacilli|Lactobacillales|Enterococcaceae|Enterococcus|Enterococcus faecalis
2463|Ribosomal_S9|CONSENSUS|100.0|Bacteria|Bacillota|Bacilli|Lactobacillales|Enterococcaceae|Enterococcus|Enterococcus faecalis

# Get updated values
db_file="additional-files/pangenomics/external-genomes/Enterococcus_faecalis_6512.db"
key_to_filter="scg_taxonomy_was_run"
query="SELECT * FROM self WHERE key = '$key_to_filter';"
$ sqlite3 "$db_file" "$query"
scg_taxonomy_was_run|1

key_to_filter="scg_taxonomy_database_version"
query="SELECT * FROM self WHERE key = '$key_to_filter';"
$ sqlite3 "$db_file" "$query"
scg_taxonomy_database_version|GTDB: v214.1; Anvi'o: v1

Are self values look like they are updating properly :)

mschecht commented 7 months ago

Ok I found an edge case!

Here is an example of the problem:

test_db.zip

unzip test_db.zip

cd test_db

anvi-script-gen-genomes-file --input-dir . -o external-genomes.txt

$ anvi-estimate-scg-taxonomy -M external-genomes.txt \
                           --metagenome-mode \
                           --scg-name-for-metagenome-mode Ribosomal_L19 \
                           --raw-output \
                           -O asdf

Traceback (most recent call last):
  File "/Users/mschechter/github/anvio/bin/anvi-estimate-scg-taxonomy", line 104, in <module>
    main(args)
  File "/Users/mschechter/github/anvio/anvio/terminal.py", line 915, in wrapper
    program_method(*args, **kwargs)
  File "/Users/mschechter/github/anvio/bin/anvi-estimate-scg-taxonomy", line 39, in main
    t.estimate()
  File "/Users/mschechter/github/anvio/anvio/taxonomyops/scg.py", line 447, in estimate
    scg_taxonomy_super_dict_multi = self.get_scg_taxonomy_super_dict_for_metagenomes()
  File "/Users/mschechter/github/anvio/anvio/taxonomyops/scg.py", line 937, in get_scg_taxonomy_super_dict_for_metagenomes
    scg_taxonomy_super_dict[metagenome_name] = SCGTaxonomyEstimatorSingle(args, progress=progress_quiet, run=run_quiet).get_items_taxonomy_super_dict()
  File "/Users/mschechter/github/anvio/anvio/taxonomyops/scg.py", line 999, in __init__
    TaxonomyEstimatorSingle.__init__(self, skip_init=skip_init)
  File "/Users/mschechter/github/anvio/anvio/taxonomyops/__init__.py", line 175, in __init__
    self.init()
  File "/Users/mschechter/github/anvio/anvio/taxonomyops/__init__.py", line 179, in init
    self.init_items_data()
  File "/Users/mschechter/github/anvio/anvio/taxonomyops/__init__.py", line 269, in init_items_data
    self.item_name_to_gene_caller_id_dict[item_gene_name].add(gene_callers_id)
KeyError: 'Ribosomal_L9_C'

I think this problem is caused by if you run [anvi-run-scg-taxonomy]() on a contigs-db that does not contain any of the new list of SCGs it will exit out as shown here.

Here are two solutions and I don't know which way to go considering we already merged the migration script:

  1. The migration needs to clear all scg_taxonomy tables regardless of whether the new version was ran.
  2. [anvi-run-scg-taxonomy]() should also remove any contents that might still be in the scg_taxonomy table IF it doesn't find any new SCG here
meren commented 7 months ago

This should have never happened, @mschecht. Could this be a contigs-db you migrated with the previous version of the migration script?

It has the following SCG in the table:

image

Even though Ribosomal_L9_C is not one of the SCGs we're using:

default_scgs_for_taxonomy = ['Ribosomal_L1',
                             'Ribosomal_L13',
                             'Ribosomal_L14',
                             'Ribosomal_L16',
                             'Ribosomal_L17',
                             'Ribosomal_L19',
                             'Ribosomal_L2',
                             'Ribosomal_L20',
                             'Ribosomal_L21p',
                             'Ribosomal_L22',
                             'Ribosomal_L27A',
                             'Ribosomal_L3',
                             'Ribosomal_L4',
                             'Ribosomal_L5',
                             'Ribosomal_S11',
                             'Ribosomal_S15',
                             'Ribosomal_S16',
                             'Ribosomal_S2',
                             'Ribosomal_S6',
                             'Ribosomal_S7',
                             'Ribosomal_S8',
                             'Ribosomal_S9']

AND even though the version of the DB shows that it is new:

scg_taxonomy_database_version ................: GTDB: v214.1; Anvi'o: v1

When I manually change the self table to revert this contigs-db back to v22 and set the scg_taxonomy_database_version to v214.1, then everything works out as expected:

:: anvi'o v7 dev ::  ~/Downloads/test_db >>> anvi-migrate ORAL_P-D-F_Bin_00037.db --migrate-quickly

NEW MIGRATION TASK
===============================================
Input file path ..............................: ORAL_P-D-F_Bin_00037.db
Input file type ..............................: contigs
Current Version ..............................: 22
Target Version ...............................: 23
Migration mode ...............................: Adventurous

SQLite Version ...............................: 3.41.2

* Your contigs database is now version 23. Sadly this update removed all SCG
  taxonomy data in this contigs-db due to a change in the set of SCGs anvi'o now
  uses for taxonomy estimation. As a result, you will need to re-run anvi-run-
  scg-taxonomy command on this contigs-db :/ If you would like to learn why this
  was necessary, please visit https://github.com/merenlab/anvio/issues/2211. We
  thank you for your patience!

 :: anvi'o v7 dev ::  ~/Downloads/test_db >>> anvi-estimate-scg-taxonomy -c ORAL_P-D-F_Bin_00037.db

Config Error: It seems the SCG taxonomy tables were not populated in this contigs database :/
              Luckily it is easy to fix that. Please see the program `anvi-run-scg-taxonomy`.

 :: anvi'o v7 dev ::  ~/Downloads/test_db >>> anvi-run-scg-taxonomy -c ORAL_P-D-F_Bin_00037.db

WARNING
===============================================
This contigs database contains no single-copy core gene sequences that are used
by the anvi'o taxonomy headquarters in Lausanne. Somewhat disappointing but
totally OK.

 :: anvi'o v7 dev ::  ~/Downloads/test_db >>> anvi-estimate-scg-taxonomy -c ORAL_P-D-F_Bin_00037.db
Contigs DB ...................................: ORAL_P-D-F_Bin_00037.db
Metagenome mode ..............................: False

Estimated taxonomy for "ORAL_P-D-F_Bin_00037"
===============================================
+----------------------+--------------+-------------------+------------------+
|                      |   total_scgs |   supporting_scgs | taxonomy         |
+======================+==============+===================+==================+
| ORAL_P-D-F_Bin_00037 |            0 |                 0 | /  /  /  /  /  / |
+----------------------+--------------+-------------------+------------------+
meren commented 7 months ago

(So this is not an edge case others will run into -- it is just a Frankenstein contigs-db that was updated with an earlier version of the migration script before it was merged to master, OR something else similar to that :))

mschecht commented 7 months ago

Thanks for looking into this @meren! It must have been an artifact I introduced while developing. Glad it won't impact anyone!