Different substrate annotations in dbCAN-PUL and run_dbcan4

linnabrown / run_dbcan

Run_dbcan V4, using genomes/metagenomes/proteomes of any assembled organisms (prokaryotes, fungi, plants, animals, viruses) to search for CAZymes.

http://bcb.unl.edu/dbCAN2

GNU General Public License v3.0

145 stars 40 forks source link

Different substrate annotations in dbCAN-PUL and run_dbcan4 #132

Open Russel88 opened 1 year ago

Russel88 commented 1 year ago

Hi developers,

There seems to be different substrate annotations in dbCAN-PUL and those in the output from run_dbcan. Not sure if it's intended or an error. For example, PUL0291 has lactose as substrate in dbCAN-PUL, which also fits with what is stated in the associated reference paper. However, in this file "dbCAN-PUL_07-01-2022.txt" used by run_dbcan4, the substrate for PUL0291 is "human milk oligosaccharide".

I also found this file "dbCAN-PUL.substrate.mapping.xls" on the FTP server, which has both a "curated substrate" and two "updated substrate" columns, but only the curated one seems to be correct. Am I misunderstanding something?

yinlabniu commented 11 months ago

Thanks for the question. The substrate curation is not an easy job as glycans appear with different names and defined at different levels (e.g., lactoses are human milk oligosaccharides, and HMOs are host glycans). in the literature, and we are not happy with our curation either. dbCAN-PUL.substrate.mapping.xls is the most complete ref table and actively maintained, but dbCAN-PUL_07-01-2022.txt is what is used in run_dbcan. dbCAN-PUL.substrate.mapping.xls has updated substrate cols reflecting our continuous efforts in grouping substrates. But you are correct that the "curated_substrate (07/01/2022)" col was directly extracted from literature, which were used to derive the "updated_substrate" columns (at higher levels for glycan definitions).

In short, there is no easy solution for now, and we are continuously working on a hierarchical classification to define glycans.

Yanbin

Weilan2011 commented 7 months ago

I got similar questions/issues. I tested the annotation with several "well-documented genomes" on the server, which provided reasonable substrates predictions. So did a batch annotation for 500 genomes by following the "Run from Raw Reads: Automated CAZyme and Glycan Substrate Annotation in Microbiomes: A Step-by-Step Protocol". However, the outputs from "run_dbcan" for the same genomes were quite different from what I got from the server. These differences made me very confused. Is it necessary to manually curate the output from "run_dbCAN" with the dbCAN-PUL.substrate.mapping.xls?