linnabrown / run_dbcan

Run_dbcan V4, using genomes/metagenomes/proteomes of any assembled organisms (prokaryotes, fungi, plants, animals, viruses) to search for CAZymes.
http://bcb.unl.edu/dbCAN2
GNU General Public License v3.0
130 stars 40 forks source link

finding %completeness of CGC in the microbial genome of interest <Theory question and suggestions> #161

Open Jigyasa3 opened 5 months ago

Jigyasa3 commented 5 months ago

Dear @yinlabniu,

Thank you again for a very important tool to annotate CAZymes and identify CGCs in the microbial genomes of interest. I am interested in examining how complete are the CGCs in my microbial genome of interest. For example, if dbcan3 identifies 5 CGCs in my microbial genome of interest. To understand the %completeness of these CGCs, I extract out nucleotide sequences spanning the start and end coordinates of the CGCs and PULs from dbcan-PUL database. Then I do a BLASTn search of the 5 CGC sequences against the complete dbcan-PUL database to get %similarity and %coverage.

Is that a correct approach? My goal is to bioinformatically say that we found 5 CGCs in the microbial genome, which are XYZ % similar to known PULs and have ABC % of completeness so we can speculate that these CGCs would be functional. But if the similarity and coverage are less than ~40% (arbitrary cutoff) then it's either a novel CGC or a non-functional CGC.

Looking forward to your suggestions and reply! Regards, Jigyasa

yinlabniu commented 5 months ago

The short answer is yes. We used a similar strategy in dbCAN3 when predicting substrates for CGCs by blast search against dbCAN-PULs, while our parsing thresholds are more relaxed (min identity 20% and min 2 CAZyme matches to call a CGC-PUL pair). However, I should mention that the boundary of CGCs (which affects the length of CGCs) is never rigorously evaluated. PUL boundaries are often experimentally determined (e.g., through rna-seq differential expression), but CGC boundaries are arbitrarily determined based on our CGC prediction criteria (default: at least one CAZyme and one transporter, and the number of inserted non-signature genes are less than 2; this can be customized by users). Therefore, in many cases, the %coverage or completeness cutoff you mentioned is difficult to determine.

linnabrown commented 2 months ago

Do you still have questions? @Jigyasa3 If not, please close the issue.