Open mschecht opened 12 months ago
Hi, I just spotted this function on the development version of anvi'o and am very excited to use it! I've been doing CAZyme annotations outside of anvi'o for a while now, but I think it would be great to incorporate them :)
I was wondering if there are any plans to incorporate e.g. CAZy-DB as a search database, or in general to follow dbCAN's recommendation for integrating different search tools (e.g. hmm and DIAMOND searches) to ensure for more accurate annotations?
Thanks, Isa
Hi @iwilkie, thanks for the question!
anvi-run-cazymes only runs the HMMs from dbCAN2 across the amino acids sequences in a contigs-db with hmmscan or hmmsearch. However, you can use anvi-import-functions to import any annotations from any homology detection strategy you want to a contigs-db. For example, you can export the amino acid sequences from your contigs-db and then search them against CAZy-DB with DIAMOND, and finally import those annotations back into the contigs-db.
Please let me know if you have anymore questions or if I can clarify more!
@mschecht Thanks for the quick and detailed reply! Yes, exporting the AA sequence annotating it and then bringing that back into anvi'o works well, that's what I have been doing. I just stumbled upon this issue and was curious as to how much this function would be incorporated into anvi'o.
Thanks again! :)
@iwilkie I started working on this feature request here and got confused about where I can find dbCAN3 files, in particular, the dbCAN-sub HMMs.
Is this the dbCAN3 dbCAN_sub file? It's under the dir /dbCAN2/
, but that's what's hyperlinked from the dbCAN3 downloads page.
@mschecht as far as I know the HMM dbs for dbCAN and dbCAN-sub are different. Not all prots with a CAZy domain (i.e. matching dbCAN HMM profile), match dbCAN-sub profiles. Some dbCAN-sub profiles are not necessarily made from a subset of a dbCAN family domain, e.g. the subfamily PL6_e6
profile include sequences matching PL6 and CBM16
Coming back to the files, the one you indicate is the dbCAN-sub HMM, this one is the current dbCAN HMM (barely one month old).
In addition, this file can be used with the dbCAN-sub output to map EC/substrates to some of the subfamilies (which would be interesting to add too :wink: )
Thanks for the input @xvazquezc!
Coming back to the files, the one you indicate is the dbCAN-sub HMM, this one is the current dbCAN HMM (barely one month old).
To clarify, is this file a dbCAN-sub HMM file? https://bcb.unl.edu/dbCAN2/download/dbCAN-HMMdb-V12.txt
I thought this was the standard dbCAN and not the dbCAN-sub. anvi-setup-cazymes can download this no problem!
Could this be the dbCAN_sub file? https://bcb.unl.edu/dbCAN2/download/Databases/dbCAN_sub.hmm
In addition, this file can be used with the dbCAN-sub output to map EC/substrates to some of the subfamilies (which would be interesting to add too 😉 )
That would be super cool! I think all this would take is a simple join
with the dbCAN annotations in the contigs-db
. What would be the best output file for you to leverage this data?
I know it's confusing, they use an URL address with dbCAN2
in it, but that's where dbCAN3
server is located... the old dbCAN2
is at dbCAN2-obsolete
. The new dbCAN3
is basically the same base dbCAN2
plus the dbCAN-sub
and substrate prediction - both through dbCAN-sub and dbCAN-PUL*.
About the files, this is the current dbCAN HMM: https://bcb.unl.edu/dbCAN2/download/dbCAN-HMMdb-V12.txt
, and this is the dbCAN-sub HMM: https://bcb.unl.edu/dbCAN2/download/Databases/dbCAN_sub.hmm
As for the substrate prediction, I think it needs to be matched based on the CAZy family and the predicted EC by dbCAN-sub (but tbh I'm not sure about the exact way the dbCAN server does it). Best would be to check the run_dbCAN
repo: https://github.com/linnabrown/run_dbcan
dbCAN-sub and the substrate predictions might be better with metabolic prediction infrastructure... never got to deal with anvio-estimate-metabolism
and related stuff so I'm not so confident about suggesting the best place this may go
*dbCAN-PUL relies in the CGCfinder (code here) and annotates experimentally validated Polysaccharide Utilization Loci (PUL) by searching transcription factors and transporters in the surrounding genes around CAZy-annotated ones. It seems the preferred method for the substrate matching but it also has a way more complex operation
@xvazquezc thank you very much for breaking this down for the anvi'o community!
About the files, this is the current dbCAN HMM: https://bcb.unl.edu/dbCAN2/download/dbCAN-HMMdb-V12.txt, and this is the dbCAN-sub HMM: https://bcb.unl.edu/dbCAN2/download/Databases/dbCAN_sub.hmm
With this in mind, I will add the dbCAN-sub HMM file to anvi-run-cazymes
so that users can access CAZyme HMMs.
dbCAN-sub and the substrate predictions might be better with metabolic prediction infrastructure... never got to deal with anvio-estimate-metabolism and related stuff so I'm not so confident about suggesting the best place this may go
That is a great point! @ivagljiva what are your thoughts?
dbCAN-sub and the substrate predictions might be better with metabolic prediction infrastructure... never got to deal with anvio-estimate-metabolism and related stuff so I'm not so confident about suggesting the best place this may go
That is a great point! @ivagljiva what are your thoughts?
Hey y'all :)
If I am understanding correctly, dbCAN-sub is another set of HMMs that provides more specific gene annotations?
If so, then it doesn't directly have a place in metabolism prediction and should still be used via anvi-run-cazymes
to include annotations for these HMMs within the gene functions table.
However, users can then define their own metabolic pathways using the dbCAN-sub as a possible annotation source for the enzymes in the pathway :)
If I am understanding correctly, dbCAN-sub is another set of HMMs that provides more specific gene annotations? If so, then it doesn't directly have a place in metabolism prediction and should still be used via anvi-run-cazymes to include annotations for these HMMs within the gene functions table.
Sounds good, thanks for the input!
However, users can then define their own metabolic pathways using the dbCAN-sub as a possible annotation source for the enzymes in the pathway :)
I like that a lot! This could fill the niche where users are studying PULs that are not currently available via the CAZyme frame work.
@meren to finish this feature request, I need to incorporate two HMM files into anvi-run-cazymes
:
I am currently working on this branch upgrade-to-dbCAN3.
dbCAN-HMMdb-V12.txt is already integrated but I am not sure how to smoothly add in the extra set of HMMs from dbCAN_sub.hmm. My two thoughts are (1) I can concatenate dbCAN_sub.hmm to dbCAN-HMMdb-V12.txt or (2) run HMMER separately on both HMM datasets. Which direction makes the most sense?
dbCAN treats them as 2 separate HMM libraries so I'd say to do the same and if a prot has matches with both, it's used as stronger evidence for that prot to be an actual CAZyme.
Thanks for pointing this out @xvazquezc! That definitely answers my question :)
We can address #2148 in this branch :)
@mschecht sorry about not getting back to you on this! Somehow the notifications/emails didn't come through and I only just noticed.
I see that @xvazquezc was able to answer however, thank you :-)
The need
dbCAN3 contains dbCAN_sub which enables substrate annotation for CAZymes. This upgrade will help catalyze CAZyme-related studies in anvi'o.
The solution
Update anvi-setup-cazymes to download dbCAN3.
Beneficiaries
Biochemists and microbial ecologists interested in carbohydrate heterotrophy!