merenlab / anvio

An analysis and visualization platform for 'omics data
http://merenlab.org/software/anvio
GNU General Public License v3.0
423 stars 144 forks source link

[BUG] incompatibility between modules db and contigs db in metabolism self-test #2128

Closed semiller10 closed 11 months ago

semiller10 commented 11 months ago

Short description of the problem

Test genomes used for metabolism estimation in anvi-self-test --suite metabolism have contigs databases that contain KEGG annotations from a modules database that is not up-to-date with the current KEGG snapshot.

anvi'o version

Anvi'o .......................................: hope (v7.1-dev)
Python .......................................: 3.10.12

Profile database .............................: 38
Contigs database .............................: 21
Pan database .................................: 16
Genome data storage ..........................: 7
Auxiliary data storage .......................: 2
Structure database ...........................: 2
Metabolic modules database ...................: 4
tRNA-seq database ............................: 2

Detailed description of the issue

An error is thrown because the modules db hash is inconsistent between the test genome contigs db and the new modules db built in the test run from the latest KEGG snapshot. The obvious solution is to update the KEGG annotations of the anvi'o sandbox contigs dbs, but perhaps a better solution would be to reannotate the copies of the contigs dbs used in the test script by running anvi-run-kegg-kofams in the script. This would prevent future discrepancies between modules dbs built in the script from an up-to-date KEGG snapshot and out-of-date contigs db annotations.

ivagljiva commented 11 months ago

Thanks, again, for catching this @semiller10. I took your advice and implemented the reannotation of all databases in the self test. Since it can take a really long time, I also added multi-threading capacity to the anvi-self-test program to speed things along a bit. But setting up KEGG data also takes so much time at this point (19 minutes in my current run of anvi-self-test --suite metabolism) that the extra overhead from reannotation doesn't really matter anyway, I guess :)

See commits f26a21e thru 8fec134