Closed yqy6611 closed 11 months ago
Hi @yqy6611 , thanks for the report. It looks like something is wrong with your kegg database. I wonder if some formatting has changed on KEGG's end that is breaking our parsing functions. Could you please tell me the following:
1) what parameters did you use to download the KEGG data (ie, what did your anvi-setup-kegg-data
command look like)?
2) What is the hash of your modules database? you can find this by running anvi-db-info /mnt/cc061462-beb3-49ab-9318-5c7bf54d8588/database/anvio-KEGG
Thanks!
Hi @ivagljiva . Thanks for your super quick response!
The command looks like this: anvi-setup-kegg-data --kegg-data-dir /mnt/cc061462-beb3-49ab-9318-5c7bf54d8588/database/anvio-KEGG --download-from-kegg. No error occurred.
And the hash is 9d23e527fc1f. Attached is the log:
(anvio-dev) e2s2@e2s2-super-server-1:~$ anvi-db-info '/mnt/cc061462-beb3-49ab-9318-5c7bf54d8588/database/anvio-KEGG/MODULES.db'
Database Path ................................: /mnt/cc061462-beb3-49ab-9318-5c7bf54d8588/database/anvio-KEGG/MODULES.db db_type ......................................: modules (variant: None) version ......................................: 4
data_source ..................................: KEGG annotation_sources ...........................: KOfam num_modules ..................................: 479 total_module_entries .........................: 15204 is_brite_setup ...............................: 1 num_brite_hierarchies ........................: 57 total_brite_entries ..........................: 55266 creation_date ................................: 1702445076.06138 hash .........................................: 9d23e527fc1f
Thanks @yqy6611, it looks like a version of KEGG I haven't downloaded before, so possibly the change on their end was fairly recent. I'll see if I can download the same one and test it on my computer, then get back to you :)
I managed to download the same version of KEGG data, and I've found the problem :) It is indeed due to a change in the way KEGG stores data, specifically for chemical reactions (the 'REACTION' data type in KEGG MODULE files). They used to store only compound IDs for each reaction without considering stoichiometry, like this example taken from a KEGG snapshot with hash a2b5bde358bb
(the default for anvi'o v8):
R02164 C00399 + C00042 -> C00390 + C00122
But now, these reaction strings can include stoichiometry. For instance, here is a reaction from the current version of KEGG data:
R02161 C00390 + 2 C00125 -> C00399 + 2 C00126 + 2 C00080
Currently, when we split these REACTION strings, we don't expect to find numbers in between the compound IDs, so the numbers are getting included with the IDs (which is what leads to errors like this: KeyError: '2 C00125'
. The 2
should not be in there).
I think I should be able to fix it by splitting each compound sub-string further and keeping only the last element. Working on it now.
@yqy6611 , I fixed this problem with commit https://github.com/merenlab/anvio/commit/032a27bd426fc0aa75e214bbd97534d99523bc9c
If you update your anvio-dev codebase with a git pull
and re-try your anvi-estimate-metabolism
command, it should work now :)
Thanks @ivagljiva ! It worked. Then should anvi-reaction-network be revised as well?
anvi-reaction-network
is part of a separate suite of programs and doesn't utilize the MODULES database, so it won't suffer from this problem :)
(at least, I think it won't. But it is possible that this change to the KEGG data format could cause issues elsewhere in the code. I would just try running anvi-reaction-network
and see. If you get errors, you can open another issue)
Short description of the problem
"KeyError: '2 C00125'" was encountered after running anvi-estimate-metabolism.
anvi'o version
Anvi'o .......................................: marie (v8-dev) Python .......................................: 3.10.13
Profile database .............................: 39 Contigs database .............................: 22 Pan database .................................: 17 Genome data storage ..........................: 7 Auxiliary data storage .......................: 2 Structure database ...........................: 2 Metabolic modules database ...................: 4 tRNA-seq database ............................: 2
System info
System: Ubuntu 22.04 Source of anvi'o: following https://anvio.org/install/linux/dev/
Detailed description of the issue
I encountered this issue when running anvi-estimate-metabolism. Attached is the log:
(anvio-dev) e2s2@e2s2-super-server-1:/mnt/b351004a-4dc6-4bcf-add3-21fc26707773/Charmaine$ anvi-estimate-metabolism -e anvio/genome_list.txt -O anvio/SFA_metabolism --kegg-data-dir /mnt/cc061462-beb3-49ab-9318-5c7bf54d8588/database/anvio-KEGG
CITATION
Anvi'o will reconstruct metabolism for modules in the KEGG MODULE database, as described in Kanehisa and Goto et al (doi:10.1093/nar/gkr988). When you publish your findings, please do not forget to properly credit this work.
Metabolism data ..............................: KEGG only External genomes file ........................: anvio/genome_list.txt
WARNING
You (or the programmer) requested genome descriptions for your internal and/or external genomes to be loaded without a 'full init'. There is nothing for you to be concerned. This is just a friendly reminder to make sure you know that if something goes terribly wrong later (like your computer sets itself on fire), this may be the reason.
Num Contigs DBs in file ......................: 314
main(args)
File "/home/e2s2/Softwares/anvio/anvio/terminal.py", line 915, in wrapper
program_method(*args, **kwargs)
File "/home/e2s2/Softwares/anvio/bin/anvi-estimate-metabolism", line 39, in main
m.estimate_metabolism()
File "/home/e2s2/Softwares/anvio/anvio/kegg.py", line 6346, in estimate_metabolism
self.init_data_from_modules_db()
File "/home/e2s2/Softwares/anvio/anvio/kegg.py", line 2518, in init_data_from_modules_db
module_substrate_list, module_intermediate_list, module_product_list = self.kegg_modules_db.get_human_readable_compound_lists_for_module(mod)
File "/home/e2s2/Softwares/anvio/anvio/kegg.py", line 7631, in get_human_readable_compound_lists_for_module
substrate_name_list = [compound_to_name_dict[c] for c in substrate_compounds]
File "/home/e2s2/Softwares/anvio/anvio/kegg.py", line 7631, in
substrate_name_list = [compound_to_name_dict[c] for c in substrate_compounds]
KeyError: '2 C00125'
Metagenome Mode ..............................: False Traceback (most recent call last): File "/home/e2s2/Softwares/anvio/bin/anvi-estimate-metabolism", line 132, in