merenlab / anvio

An analysis and visualization platform for 'omics data
http://merenlab.org/software/anvio
GNU General Public License v3.0
415 stars 142 forks source link

[BUG] KeyError occured after running anvi-estimate-metabolism #2189

Closed yqy6611 closed 7 months ago

yqy6611 commented 7 months ago

Short description of the problem

"KeyError: '2 C00125'" was encountered after running anvi-estimate-metabolism.

anvi'o version

Anvi'o .......................................: marie (v8-dev) Python .......................................: 3.10.13

Profile database .............................: 39 Contigs database .............................: 22 Pan database .................................: 17 Genome data storage ..........................: 7 Auxiliary data storage .......................: 2 Structure database ...........................: 2 Metabolic modules database ...................: 4 tRNA-seq database ............................: 2

System info

System: Ubuntu 22.04 Source of anvi'o: following https://anvio.org/install/linux/dev/

Detailed description of the issue

I encountered this issue when running anvi-estimate-metabolism. Attached is the log:

(anvio-dev) e2s2@e2s2-super-server-1:/mnt/b351004a-4dc6-4bcf-add3-21fc26707773/Charmaine$ anvi-estimate-metabolism -e anvio/genome_list.txt -O anvio/SFA_metabolism --kegg-data-dir /mnt/cc061462-beb3-49ab-9318-5c7bf54d8588/database/anvio-KEGG

CITATION

Anvi'o will reconstruct metabolism for modules in the KEGG MODULE database, as described in Kanehisa and Goto et al (doi:10.1093/nar/gkr988). When you publish your findings, please do not forget to properly credit this work.

Metabolism data ..............................: KEGG only External genomes file ........................: anvio/genome_list.txt

WARNING

You (or the programmer) requested genome descriptions for your internal and/or external genomes to be loaded without a 'full init'. There is nothing for you to be concerned. This is just a friendly reminder to make sure you know that if something goes terribly wrong later (like your computer sets itself on fire), this may be the reason.

Num Contigs DBs in file ......................: 314
Metagenome Mode ..............................: False Traceback (most recent call last): File "/home/e2s2/Softwares/anvio/bin/anvi-estimate-metabolism", line 132, in main(args) File "/home/e2s2/Softwares/anvio/anvio/terminal.py", line 915, in wrapper program_method(*args, **kwargs) File "/home/e2s2/Softwares/anvio/bin/anvi-estimate-metabolism", line 39, in main m.estimate_metabolism() File "/home/e2s2/Softwares/anvio/anvio/kegg.py", line 6346, in estimate_metabolism self.init_data_from_modules_db() File "/home/e2s2/Softwares/anvio/anvio/kegg.py", line 2518, in init_data_from_modules_db module_substrate_list, module_intermediate_list, module_product_list = self.kegg_modules_db.get_human_readable_compound_lists_for_module(mod) File "/home/e2s2/Softwares/anvio/anvio/kegg.py", line 7631, in get_human_readable_compound_lists_for_module substrate_name_list = [compound_to_name_dict[c] for c in substrate_compounds] File "/home/e2s2/Softwares/anvio/anvio/kegg.py", line 7631, in substrate_name_list = [compound_to_name_dict[c] for c in substrate_compounds] KeyError: '2 C00125'

ivagljiva commented 7 months ago

Hi @yqy6611 , thanks for the report. It looks like something is wrong with your kegg database. I wonder if some formatting has changed on KEGG's end that is breaking our parsing functions. Could you please tell me the following:

1) what parameters did you use to download the KEGG data (ie, what did your anvi-setup-kegg-data command look like)? 2) What is the hash of your modules database? you can find this by running anvi-db-info /mnt/cc061462-beb3-49ab-9318-5c7bf54d8588/database/anvio-KEGG

Thanks!

yqy6611 commented 7 months ago

Hi @ivagljiva . Thanks for your super quick response!

The command looks like this: anvi-setup-kegg-data --kegg-data-dir /mnt/cc061462-beb3-49ab-9318-5c7bf54d8588/database/anvio-KEGG --download-from-kegg. No error occurred.

And the hash is 9d23e527fc1f. Attached is the log:

(anvio-dev) e2s2@e2s2-super-server-1:~$ anvi-db-info '/mnt/cc061462-beb3-49ab-9318-5c7bf54d8588/database/anvio-KEGG/MODULES.db'

DB Info (no touch)

Database Path ................................: /mnt/cc061462-beb3-49ab-9318-5c7bf54d8588/database/anvio-KEGG/MODULES.db db_type ......................................: modules (variant: None) version ......................................: 4

DB Info (no touch also)

data_source ..................................: KEGG annotation_sources ...........................: KOfam num_modules ..................................: 479 total_module_entries .........................: 15204 is_brite_setup ...............................: 1 num_brite_hierarchies ........................: 57 total_brite_entries ..........................: 55266 creation_date ................................: 1702445076.06138 hash .........................................: 9d23e527fc1f

ivagljiva commented 7 months ago

Thanks @yqy6611, it looks like a version of KEGG I haven't downloaded before, so possibly the change on their end was fairly recent. I'll see if I can download the same one and test it on my computer, then get back to you :)

ivagljiva commented 7 months ago

I managed to download the same version of KEGG data, and I've found the problem :) It is indeed due to a change in the way KEGG stores data, specifically for chemical reactions (the 'REACTION' data type in KEGG MODULE files). They used to store only compound IDs for each reaction without considering stoichiometry, like this example taken from a KEGG snapshot with hash a2b5bde358bb (the default for anvi'o v8):

R02164    C00399 + C00042 -> C00390 + C00122

But now, these reaction strings can include stoichiometry. For instance, here is a reaction from the current version of KEGG data:

 R02161    C00390 + 2 C00125 -> C00399 + 2 C00126 + 2 C00080

Currently, when we split these REACTION strings, we don't expect to find numbers in between the compound IDs, so the numbers are getting included with the IDs (which is what leads to errors like this: KeyError: '2 C00125'. The 2 should not be in there).

I think I should be able to fix it by splitting each compound sub-string further and keeping only the last element. Working on it now.

ivagljiva commented 7 months ago

@yqy6611 , I fixed this problem with commit https://github.com/merenlab/anvio/commit/032a27bd426fc0aa75e214bbd97534d99523bc9c

If you update your anvio-dev codebase with a git pull and re-try your anvi-estimate-metabolism command, it should work now :)

yqy6611 commented 7 months ago

Thanks @ivagljiva ! It worked. Then should anvi-reaction-network be revised as well?

ivagljiva commented 7 months ago

anvi-reaction-network is part of a separate suite of programs and doesn't utilize the MODULES database, so it won't suffer from this problem :)

ivagljiva commented 7 months ago

(at least, I think it won't. But it is possible that this change to the KEGG data format could cause issues elsewhere in the code. I would just try running anvi-reaction-network and see. If you get errors, you can open another issue)