merenlab / anvio

An analysis and visualization platform for 'omics data
http://merenlab.org/software/anvio
GNU General Public License v3.0
417 stars 142 forks source link

[FEATURE REQUEST] update dbCAN2 → dbCAN3 for anvi-setup-cazymes #2099

Open mschecht opened 12 months ago

mschecht commented 12 months ago

The need

dbCAN3 contains dbCAN_sub which enables substrate annotation for CAZymes. This upgrade will help catalyze CAZyme-related studies in anvi'o.

The solution

Update anvi-setup-cazymes to download dbCAN3.

Beneficiaries

Biochemists and microbial ecologists interested in carbohydrate heterotrophy!

iwilkie commented 11 months ago

Hi, I just spotted this function on the development version of anvi'o and am very excited to use it! I've been doing CAZyme annotations outside of anvi'o for a while now, but I think it would be great to incorporate them :)

I was wondering if there are any plans to incorporate e.g. CAZy-DB as a search database, or in general to follow dbCAN's recommendation for integrating different search tools (e.g. hmm and DIAMOND searches) to ensure for more accurate annotations?

Thanks, Isa

mschecht commented 11 months ago

Hi @iwilkie, thanks for the question!

anvi-run-cazymes only runs the HMMs from dbCAN2 across the amino acids sequences in a contigs-db with hmmscan or hmmsearch. However, you can use anvi-import-functions to import any annotations from any homology detection strategy you want to a contigs-db. For example, you can export the amino acid sequences from your contigs-db and then search them against CAZy-DB with DIAMOND, and finally import those annotations back into the contigs-db.

Please let me know if you have anymore questions or if I can clarify more!

iwilkie commented 10 months ago

@mschecht Thanks for the quick and detailed reply! Yes, exporting the AA sequence annotating it and then bringing that back into anvi'o works well, that's what I have been doing. I just stumbled upon this issue and was curious as to how much this function would be incorporated into anvi'o.

Thanks again! :)

mschecht commented 10 months ago

@iwilkie I started working on this feature request here and got confused about where I can find dbCAN3 files, in particular, the dbCAN-sub HMMs.

Is this the dbCAN3 dbCAN_sub file? It's under the dir /dbCAN2/, but that's what's hyperlinked from the dbCAN3 downloads page.

xvazquezc commented 10 months ago

@mschecht as far as I know the HMM dbs for dbCAN and dbCAN-sub are different. Not all prots with a CAZy domain (i.e. matching dbCAN HMM profile), match dbCAN-sub profiles. Some dbCAN-sub profiles are not necessarily made from a subset of a dbCAN family domain, e.g. the subfamily PL6_e6 profile include sequences matching PL6 and CBM16

Coming back to the files, the one you indicate is the dbCAN-sub HMM, this one is the current dbCAN HMM (barely one month old).

In addition, this file can be used with the dbCAN-sub output to map EC/substrates to some of the subfamilies (which would be interesting to add too :wink: )

mschecht commented 10 months ago

Thanks for the input @xvazquezc!

Coming back to the files, the one you indicate is the dbCAN-sub HMM, this one is the current dbCAN HMM (barely one month old).

To clarify, is this file a dbCAN-sub HMM file? https://bcb.unl.edu/dbCAN2/download/dbCAN-HMMdb-V12.txt I thought this was the standard dbCAN and not the dbCAN-sub. anvi-setup-cazymes can download this no problem!

Could this be the dbCAN_sub file? https://bcb.unl.edu/dbCAN2/download/Databases/dbCAN_sub.hmm

In addition, this file can be used with the dbCAN-sub output to map EC/substrates to some of the subfamilies (which would be interesting to add too 😉 )

That would be super cool! I think all this would take is a simple join with the dbCAN annotations in the contigs-db. What would be the best output file for you to leverage this data?

xvazquezc commented 10 months ago

I know it's confusing, they use an URL address with dbCAN2 in it, but that's where dbCAN3 server is located... the old dbCAN2 is at dbCAN2-obsolete. The new dbCAN3 is basically the same base dbCAN2 plus the dbCAN-sub and substrate prediction - both through dbCAN-sub and dbCAN-PUL*.

About the files, this is the current dbCAN HMM: https://bcb.unl.edu/dbCAN2/download/dbCAN-HMMdb-V12.txt, and this is the dbCAN-sub HMM: https://bcb.unl.edu/dbCAN2/download/Databases/dbCAN_sub.hmm

As for the substrate prediction, I think it needs to be matched based on the CAZy family and the predicted EC by dbCAN-sub (but tbh I'm not sure about the exact way the dbCAN server does it). Best would be to check the run_dbCAN repo: https://github.com/linnabrown/run_dbcan

dbCAN-sub and the substrate predictions might be better with metabolic prediction infrastructure... never got to deal with anvio-estimate-metabolism and related stuff so I'm not so confident about suggesting the best place this may go

*dbCAN-PUL relies in the CGCfinder (code here) and annotates experimentally validated Polysaccharide Utilization Loci (PUL) by searching transcription factors and transporters in the surrounding genes around CAZy-annotated ones. It seems the preferred method for the substrate matching but it also has a way more complex operation

mschecht commented 10 months ago

@xvazquezc thank you very much for breaking this down for the anvi'o community!

About the files, this is the current dbCAN HMM: https://bcb.unl.edu/dbCAN2/download/dbCAN-HMMdb-V12.txt, and this is the dbCAN-sub HMM: https://bcb.unl.edu/dbCAN2/download/Databases/dbCAN_sub.hmm

With this in mind, I will add the dbCAN-sub HMM file to anvi-run-cazymes so that users can access CAZyme HMMs.

dbCAN-sub and the substrate predictions might be better with metabolic prediction infrastructure... never got to deal with anvio-estimate-metabolism and related stuff so I'm not so confident about suggesting the best place this may go

That is a great point! @ivagljiva what are your thoughts?

ivagljiva commented 10 months ago

dbCAN-sub and the substrate predictions might be better with metabolic prediction infrastructure... never got to deal with anvio-estimate-metabolism and related stuff so I'm not so confident about suggesting the best place this may go

That is a great point! @ivagljiva what are your thoughts?

Hey y'all :) If I am understanding correctly, dbCAN-sub is another set of HMMs that provides more specific gene annotations? If so, then it doesn't directly have a place in metabolism prediction and should still be used via anvi-run-cazymes to include annotations for these HMMs within the gene functions table.

However, users can then define their own metabolic pathways using the dbCAN-sub as a possible annotation source for the enzymes in the pathway :)

mschecht commented 10 months ago

If I am understanding correctly, dbCAN-sub is another set of HMMs that provides more specific gene annotations? If so, then it doesn't directly have a place in metabolism prediction and should still be used via anvi-run-cazymes to include annotations for these HMMs within the gene functions table.

Sounds good, thanks for the input!

However, users can then define their own metabolic pathways using the dbCAN-sub as a possible annotation source for the enzymes in the pathway :)

I like that a lot! This could fill the niche where users are studying PULs that are not currently available via the CAZyme frame work.

mschecht commented 10 months ago

@meren to finish this feature request, I need to incorporate two HMM files into anvi-run-cazymes:

I am currently working on this branch upgrade-to-dbCAN3.

dbCAN-HMMdb-V12.txt is already integrated but I am not sure how to smoothly add in the extra set of HMMs from dbCAN_sub.hmm. My two thoughts are (1) I can concatenate dbCAN_sub.hmm to dbCAN-HMMdb-V12.txt or (2) run HMMER separately on both HMM datasets. Which direction makes the most sense?

xvazquezc commented 10 months ago

dbCAN treats them as 2 separate HMM libraries so I'd say to do the same and if a prot has matches with both, it's used as stronger evidence for that prot to be an actual CAZyme.

mschecht commented 10 months ago

Thanks for pointing this out @xvazquezc! That definitely answers my question :)

mschecht commented 9 months ago

We can address #2148 in this branch :)

iwilkie commented 9 months ago

@mschecht sorry about not getting back to you on this! Somehow the notifications/emails didn't come through and I only just noticed.

I see that @xvazquezc was able to answer however, thank you :-)