meaning of repeated substrates

linnabrown / run_dbcan

Run_dbcan V4, using genomes/metagenomes/proteomes of any assembled organisms (prokaryotes, fungi, plants, animals, viruses) to search for CAZymes.

http://bcb.unl.edu/dbCAN2

GNU General Public License v3.0

130 stars 40 forks source link

meaning of repeated substrates #173

Open CamiAgustini opened 2 months ago

CamiAgustini commented 2 months ago

Hi all,

I am analyzing the data from the table that dbCAN gives me and I find that there are entries in which the repeated substrate appears, for example, some appear as a substrate "chitin" and others say "chitin, chitin". What is this about?

Another thing I don't understand is what the function of the number that follows the subfamily is, for example, for a GH18 I get "GH18_e428", what does the number after the e correspond to?

Xinpeng021001 commented 2 months ago

Hi, For the problem 1: We use eCAMI (k-mer based) to create subfamilies for each CAZy family and then use CAZymes with EC number as a label to annotate the substrate (manually curation) for that subfamily. Then we build the HMM model for each subfamily wt/wo (some subfam could be assigned with substrates but some may not) substrates as the dbCAN-sub HMM. 1831713809764_ pic_hd And sometimes CAZymes could be assigned with multiple subfamilies with different/same substrates (dbCAN-sub.out). That's the reason why she/he saw multiple substrates.

For the problem2: "e_XX" means subfamily, such as GH18_e428.

1871713810138_ pic_hd

Please review our dbCAN-seq update (https://doi.org/10.1093/nar/gkac1068) and dbCAN3 paper (https://doi.org/10.1093/nar/gkad328) if needed.

Hope this could help you. Please let us know if you have any other questions.

cmkobel commented 2 months ago

I have another usage question pertaining to the substrates. Why are some of them missing? I would like to know the substrates of all of the cazymes. I understand that this is a matter of manual curation, but is there a place where I can find the missing ones?

Xinpeng021001 commented 2 months ago

I have another usage question pertaining to the substrates. Why are some of them missing? I would like to know the substrates of all of the cazymes. I understand that this is a matter of manual curation, but is there a place where I can find the missing ones?

Hi, In CAZy database, there are two types of CAZymes: with EC number and without EC number (most). We use the EC number as a label to assign a substrate for our subfamily because those could be curated with known substrates from literature or databases like BRENDA. However, as we mentioned, there are a great many CAZymes without EC number and some subfamilies can only be assigned with those CAZymes. That's the reason why you can't find substrates for those subfamilies.

If you want to find those substrates, I would suggest you do some literature review or use our supplements in dbCAN3 to find a substrate/substrates at the CAZyme family level, not the subfamily level.

yinlabniu commented 2 months ago

I have another usage question pertaining to the substrates. Why are some of them missing? I would like to know the substrates of all of the cazymes. I understand that this is a matter of manual curation, but is there a place where I can find the missing ones?

From https://doi.org/10.1093/nar/gkad328: "After the subfamily classification, 3003 CAZyme subfamilies contain experimentally characterized CAZy proteins with EC numbers, and among them only 655 (21.8%) subfamilies have more than one EC numbers (Figure 1B). 23 038 CAZyme subfamilies contain no experimentally characterized CAZy proteins and no EC numbers. Their HMMs will not help substrate prediction but can still be informative with subfamily annotation".