raw-lab / MetaCerberus

Python code for versatile Functional Ontology Assignments for Metagenomes searching via Hidden Markov Model (HMM) with environmental focus of shotgun metaomics data
BSD 3-Clause "New" or "Revised" License
49 stars 7 forks source link

Potential issues with VOG annotations and questions on the decision tree #26

Closed mdhishamshaikh closed 1 month ago

mdhishamshaikh commented 1 month ago

Hello!

First of all, MetaCerberus is a great and very convenient tool to scan through multiple databases for functional annotation! Kudos to you all! I've been using MetaCerberus to annotate geNomad-identified viral proteins from a metagenomic survey. I have a few clarifications/questions regarding the decision tree. Considering that we get hits from multiple databases and after following the decision tree, you assign a best hit per target in your final output file. It is completely possible that this best hit annotation might not have a proper description. For example, GVDB might have the best hit but its annotation could be "no annotation" while KEGG or other runner-up hits might be able to assign an appropriate description to it. So, instead of taking the best hit, I would take the best hit with an appropriate description. I could of course do the same with the top5 files just make sure I am not missing on something. Would this be a valid approach? I also plan to identify non-descriptive terms per database to try and automate this a bit.

Secondly, in an attempt to find second or third best hits, I concatenated outputs of all the database hits into a single file. More often than not, the VOG hits differed from rest of the database hits.


> annotations %>% dplyr::filter(target == "k141_212130_7")
          target AMRFinder_product CAZy_product                                                COG_product GVDB_product
          <char>            <char>       <char>                                                     <char>       <char>
1: k141_212130_7      Hypothetical Hypothetical DNA polymerase I, 3'-5' exonuclease and polymerase domains Hypothetical
   KOFam_all_FOAM_product KOFam_all_KEGG_product PFAM_product     PGAP_product  PHROG_product   PVOG_product TIGRFAM_product
                   <char>                 <char>       <char>           <char>         <char>         <char>          <char>
1: polA; DNA polymerase I       DNA polymerase I Hypothetical DNA polymerase I DNA polymerase DNA polymerase    Hypothetical
                                                             VOG_product
                                                                  <char>
1: Zinc finger A20 and AN1 domain-containing stress-associated protein 9
> annotations %>% dplyr::filter(target == "k141_40613_4")
         target AMRFinder_product CAZy_product                                                COG_product GVDB_product
         <char>            <char>       <char>                                                     <char>       <char>
1: k141_40613_4      Hypothetical Hypothetical DNA polymerase I, 3'-5' exonuclease and polymerase domains Hypothetical
   KOFam_all_FOAM_product KOFam_all_KEGG_product            PFAM_product     PGAP_product  PHROG_product   PVOG_product
                   <char>                 <char>                  <char>           <char>         <char>         <char>
1: polA; DNA polymerase I       DNA polymerase I DNA polymerase family A DNA polymerase I DNA polymerase DNA polymerase
    TIGRFAM_product                                                           VOG_product
             <char>                                                                <char>
1: DNA polymerase I Zinc finger A20 and AN1 domain-containing stress-associated protein 9
> annotations %>% dplyr::filter(target == "k141_284757_11")
           target AMRFinder_product CAZy_product                       COG_product                                      GVDB_product
           <char>            <char>       <char>                            <char>                                            <char>
1: k141_284757_11      Hypothetical Hypothetical Superfamily I DNA or RNA helicase putative helicase/exonuclease | UvrD/REP helicase
   KOFam_all_FOAM_product                            KOFam_all_KEGG_product                        PFAM_product PGAP_product
                   <char>                                            <char>                              <char>       <char>
1:           Hypothetical DNA helicase II / ATP-dependent DNA helicase PcrA UvrD/REP helicase N-terminal domain Hypothetical
   PHROG_product PVOG_product TIGRFAM_product               VOG_product
          <char>       <char>          <char>                    <char>
1:  DNA helicase DNA helicase    Hypothetical virion structural protein

There are plenty of VOG hits that are considered best and are in the final file. At this point, I am concerned that there is some error in connecting

I checked the VOGDB annotation summary tsv from their website and for the descriptions for IDs between MetaCerberus and VOGDB do not match.

Target MetaCerberus_hit Metacerberus_description VOGDB_description
k141_212130_7 VOG01301 Zinc finger A20 and AN1 domain-containing stress-associated protein 9 REFSEQ hypothetical protein
k141_284757_11 VOG03885 virion structural protein REFSEQ hypothetical protein

Perhaps, there's an issue with the metadata file for VOGDB in MetaCerberus?

Looking forward to hearing from you!

Cheers, Hisham

raw-lab commented 1 month ago

Good afternoon Hisham,

Thank you for your kind words and use of MetaCerberus.

VOGDB recently updated everything. It was not updated since 2017. Then boom updated!

We have been unable to find the version we used here as it has be wiped or moved.

We will update VOGDB to the newest version shortly.

We take the best hit for the 'final annotation file,' but we provide all the individual outputs for you to decide which one was best.

We leave it up to you to decide what is best - based on your best thoughts on what database is providing the best readout for you.

It appears from the example you gave us here it has DNA pol I domain, a helicase domain, and potentially an 3'-5' exonuclease domain. You can look at PFAM/TIGRfams/PGFams to check it. Of course, is virion non-structural protein it's a replicase for DNA. ;-)

Many proteins are multiple domain especially DNA replicases. So, we leave it up to you to figure out what multidomain proteins are called.

We will leave this open for now and let you know when we replace VOGDB

many thanks, RAW Lab

mdhishamshaikh commented 1 month ago

Hey, thank you for your promptness! It's good to know that there is indeed an issue with VOGDB. I will look forward to the update and thank you for your help with the decision making :)

Cheers, Hisham

raw-lab commented 1 month ago

Good afternoon,

Thank you for using MetaCerberus and your kind words! Tell your friends. ;-)

VOG had a major update. See here. https://vogdb.org/

The older metadata from VOG database against INPHRED in order to have less hypothetical proteins. As the last update prior to this was back in 2017.

We are using the metadata list and update 225 from VOG currently.

We have updated MetaCerberus 1.4 with VOGdb 225 If you have already have 1.4 to get the new database you just run:

conda activate metacerberus 
metacerberus.py --update

This will update your databases only. Please upgrade to 1.4 if you haven't already. Lots of upgrades and faster processing times with HydraMPP.

many thanks, RAW Lab

Close for now. Let us know if you need anything?