merenlab / anvio

An analysis and visualization platform for 'omics data
http://merenlab.org/software/anvio
GNU General Public License v3.0
413 stars 142 forks source link

[BUG] Key Error when running anvi-metabolism #2238

Closed lexikazen closed 3 months ago

lexikazen commented 3 months ago

Short description of the problem

I keep getting KeyError: 1766176 when trying to run anvi-estimate metabolism, and I'm not really sure what's going on.

anvi'o version

Anvi'o .......................................: marie (v8) Python .......................................: 3.10.13

Profile database .............................: 38 Contigs database .............................: 21 Pan database .................................: 16 Genome data storage ..........................: 7 Auxiliary data storage .......................: 2 Structure database ...........................: 2 Metabolic modules database ...................: 4 tRNA-seq database ............................: 2

anvi-self-test --version

System info

Anvio was installed through our institution's RCC.

Detailed description of the issue

I'm trying to generate a kegg-metabolism-modules.txt file so I can run anvi-compute-metabolic-enrichment, but I end up with this output when I try to run anvi-estimate-metabolism:

========================================== SLURM_JOB_ID = 2236159 SLURM_NODELIST = hm01

WARNING

ProfileSuperClass found a collection focus, which means it will be initialized using only the splits in the profile database that are affiliated with the collection METABAT and all bins it describes.

Auxiliary Data ...............................: Found: AUXILIARY-DATA.db (v. 2) Profile Super ................................: Initialized with 119182 of 268661 splits: PROFILE_Umfleet.db (v. 38) Contigs DB ...................................: contigs_Umfleet_anvio8.db Profile DB ...................................: PROFILE_Umfleet.db Collection ...................................: METABAT Metagenome mode ..............................: True

CITATION

Anvi'o will reconstruct metabolism for modules in the KEGG MODULE database, as described in Kanehisa and Goto et al (doi:10.1093/nar/gkr988). When you publish your findings, please do not forget to properly credit this work.

Modules database .............................: An existing database, /hpc/apps/miniconda3/4.9.2/envs/anvio-8.0.0/lib/python3.10/site-packages/anvio/data/misc/KEGG/MODULES.db, has been loaded. Modules ......................................: 479 found BRITE KO hierarchies .........................: 57 found Metabolism data ..............................: KEGG only Output file for modules mode .................: kegg-metabolism_modules.txt Annotation sources used ......................: KOfam Gene calls from these sources ................: 1031010 found

WARNING

A subset of splits (119182 of 268661, to be precise) are requested to initiate gene-level coverage stats for. No need to worry, this is just a warning in case you are as obsessed as wanting to know everything there is to know.

✖ anvi-estimate-metabolism encountered an error after 1:41:08.694390 Traceback (most recent call last): File "/hpc/apps/miniconda3/4.9.2/envs/anvio-8.0.0/bin/anvi-estimate-metabolism", line 123, in main(args) File "/hpc/apps/miniconda3/4.9.2/envs/anvio-8.0.0/lib/python3.10/site-packages/anvio/terminal.py", line 915, in wrapper program_method(*args, **kwargs) File "/hpc/apps/miniconda3/4.9.2/envs/anvio-8.0.0/bin/anvi-estimate-metabolism", line 39, in main m.estimate_metabolism() File "/hpc/apps/miniconda3/4.9.2/envs/anvio-8.0.0/lib/python3.10/site-packages/anvio/kegg.py", line 5029, in estimate_metabolism kegg_metabolism_superdict, kofam_hits_superdict = self.estimate_for_contigs_db_for_metagenome(kofam_hits_info) File "/hpc/apps/miniconda3/4.9.2/envs/anvio-8.0.0/lib/python3.10/site-packages/anvio/kegg.py", line 4732, in estimate_for_contigs_db_for_metagenome metagenome_metabolism_superdict[contig] = self.estimate_for_list_of_splits(metabolism_dict_for_contig, bin_name=contig) File "/hpc/apps/miniconda3/4.9.2/envs/anvio-8.0.0/lib/python3.10/site-packages/anvio/kegg.py", line 4151, in estimate_for_list_of_splits self.add_module_coverage(mod, metabolism_dict_for_list_of_splits) File "/hpc/apps/miniconda3/4.9.2/envs/anvio-8.0.0/lib/python3.10/site-packages/anvio/kegg.py", line 4088, in add_module_coverage cov = self.profile_db.gene_level_coverage_stats_dict[g][s]['mean_coverage'] KeyError: 1766176

Files / commands to reproduce the issue

I used this command to run the input:

anvi-estimate-metabolism -c contigs_Umfleet_anvio8.db --output-modes modules --metagenome-mode --add-coverage -p PROFILE_Umfleet.db -C METABAT --kegg-data-dir /hpc/apps/miniconda3/4.9.2/envs/anvio-8.0.0/lib/python3.10/site-packages/anvio/data/misc/KEGG

ivagljiva commented 3 months ago

Hey @lexikazen , the bad news is that this looks like a bug with the coverage code. It would help me to fix it if you would be willing to send over some data so that I can reproduce the issue on my computer. But I'll take a look at the code regardless and see if I can figure out what is wrong.

The good news is that anvi-compute-metabolic-enrichment doesn't need the coverage data (and even if its there, it won't do anything with it anyway), so you can simply re-run anvi-estimate-metabolism without the --add-coverage flag, and you would get the output you need for the enrichment test.

meren commented 3 months ago

I am not sure why this is happening from the output alone, but the only way there is a gene caller id, g (recognized in contigs-db), that is missing in self.profile_db.gene_level_coverage_stats_dict, is that the coverages were recovered for genes known to profile-db (i.e., because there was a cutoff that discarded some contigs from profiling or there was a bin that only focused on a subset of genes in the contigs-db) and somewhere in the upstream someone doesn't know about it. Constraining the gene calls to the known universe of genes by the profile-db are handled internally when the profile-db is initialized with a collection/bin, but maybe there is a bug somewhere when it is used through anvi-estimate-metabolism. Just mentioning these as self notes :)

We can probably reproduce this error by creating a collection for any metagenome bins in which do not include all contigs in the contigs-db.

PS: over 1 million genes -- that's a nice dataset, @lexikazen :)

lexikazen commented 3 months ago

@ivagljiva I tried to attach my profile db and contigs db, but Github won't let me upload them. How should I share those files with you?

ivagljiva commented 3 months ago

Actually, it is okay @lexikazen :) I managed to find a dataset that reproduces the error. It works for me when specifying a collection without metagenome mode, but breaks when combining a collection with metagenome mode. I guess that is a test I forgot to do when developing :p I had assumed that people with collections would want to treat each bin in the collection as an individual genome rather than using the collection to split out a subset of contigs to estimate on individually, but clearly there is a need for the latter. So thank you very much for finding this edge case! I will use my test data to fix the bug :)

lexikazen commented 3 months ago

Actually, it is okay @lexikazen :) I managed to find a dataset that reproduces the error. It works for me when specifying a collection without metagenome mode, but breaks when combining a collection with metagenome mode. I guess that is a test I forgot to do when developing :p I had assumed that people with collections would want to treat each bin in the collection as an individual genome rather than using the collection to split out a subset of contigs to estimate on individually, but clearly there is a need for the latter. So thank you very much for finding this edge case! I will use my test data to fix the bug :)

Great thank you! :)

ivagljiva commented 3 months ago

I managed to fix the bug :) Turns out we were loading gene calls from all splits in the DBs even when a collection name was passed, and this conflicted downstream when the gene coverages were loaded just for the collection.

The PR #2242 addresses the issue in anvio-dev. @lexikazen , if you want, you could install the development branch and try your command again in that environment, and it should work :)

meren commented 3 months ago

Thank you very much, Iva! :)