merenlab / anvio

An analysis and visualization platform for 'omics data
http://merenlab.org/software/anvio
GNU General Public License v3.0
433 stars 144 forks source link

Inconsistencies between KEGG Mapper Reconstruction Result vs Anvio generated KEGG completion values. #1863

Closed JSSaini closed 2 years ago

JSSaini commented 2 years ago

Hi, I am analysing the completion values of pathways from a prospective microbial eukaryotic genome. I have used two strategies two check the competition of pathways; 1) KEGG Mapper Reconstruction Result 2) Anvio generated KEGG pathway completion values.

1) KEGG Mapper Reconstruction Result: This takes input from Ghostkoala and generates associated completion pathways based on K-numbers (5 digit numbers) also called map numbers.
According to the KEGG Mapper Reconstruction Results, two of the pathways (M00169, M00172) were suggested complete. Please see the attached example results and two pathways are highlighted in RED.

image

Also attached user_ko.txt to visualize the same results using KEGG reconstruct.

user_ko.txt

https://onlinelibrary.wiley.com/doi/10.1002/pro.4172

2) Anvio based results of KEGG pathways Now looking at the same modules/pathways but this time with anvio generated (kegg_out_f2_modules_custom.txt attached) suggests pathways are incomplete. kegg_out_f2_modules_custom.txt.

Please have a look. Thank you.

Kind regards, Jaspreet

meren commented 2 years ago

This issue is closed with the "missing info / invalid report" label since it doesn't follow the New Issue template.

ivagljiva commented 2 years ago

I just have one comment -

A possible reason for this inconsistency is the use of a different set of KEGG Orthologs and Modules. Anvi'o v7 uses a snapshot of KEGG from December 2020. As KEGG is frequently updated, it could be that new modules and/or KOs have been added in the meantime and the additional complete pathways that you see from GhostKoala/Mapper are coming from the up-to-date databases. If you wanted to use the most up-to-date version of the KEGG database with anvi'o, you could run anvi-setup-kegg-kofams with the -D flag.

You did not provide details on how you generated the anvi'o output files, but there could also be differences resulting from the set of parameters you chose.

JSSaini commented 2 years ago

Hi, Thanks for your prompt answers.

I used anvio to get the protein sequences first which was fed to GhostKoala and gave 1) KEGG Mapper Reconstruction Result.

anvi-get-sequences-for-gene-calls -c CONTIGS.db \ --get-aa-sequences \ -o protein-sequences.fa

The second output (2) was by using anvi-estimate-metabolism as follows;

anvi-estimate-metabolism -c contigs_197.db -p ./All_SAMPLES-MERGED_P/PROFILE.db --kegg-output-modes modules,kofam_hits_in_modules,kofam_hits --add-coverage -O kegg_out_f

The use of different databases is most likely the reason.

xvazquezc commented 2 years ago

Hi, I just wanted to point out that anvi'o uses hmm profiles for the KO assignment in a more similar way to KofamScan/KofamKOALA. GhostKOALA search uses homology search with GHOSTX, an accelerated BLASTX-like algorithm. So I would expect some differences, esp with borderline hits