linnabrown / run_dbcan

Run_dbcan V4, using genomes/metagenomes/proteomes of any assembled organisms (prokaryotes, fungi, plants, animals, viruses) to search for CAZymes.
http://bcb.unl.edu/dbCAN2
GNU General Public License v3.0
138 stars 40 forks source link

dbsub.out matches multiple hits per Gene.ID. Do we keep all or the best hit? #149

Open Jigyasa3 opened 8 months ago

Jigyasa3 commented 8 months ago

Hi @linnabrown and @yinlabniu

I am examining the "dbsub.out" file, and have about 500 Gene IDs with multiple dbCAN.subfam and Substrate. Do you recommend keeping all the hits or selecting the best hit?

For example, in the screenshot below, I am interested in examining all the Chitin degrading Gene IDs, so I am worried that I might lose that information if I only end up selecting the best hit. Keeping all the hits would suggest that this protein can target cellulose, chitin, xylan. Additionally, while both GH and CBM annotation is important for a Gene ID to determine if the GH has an accessory domain or not. Do you recommend keeping Gene IDs with only CBM annotation for substrate selection (eg. CBM will only have accessory roles in chitin degradation, and I should only examine GHs with CBM for this substrate)?

I am looking for suggestions if my understanding of the output is correct.

Screenshot 2024-01-08 at 8 02 31 PM
linnabrown commented 8 months ago

@yinlabniu @QiweiGe Could you please answer this question?

yinlabniu commented 8 months ago

You should keep all of them. This file has already been parsed and considered the presence of multiple domains in the same query protein. In your shown case, the protein has four domains and each gave you a substrate prediction, so you should keep all of them.

Note in our new run_dbcan release, the file name dbsub.out is changed dbcan-sub.hmm.out. To give you another example, in the following file: https://bcb.unl.edu/dbCAN_tutorial/dataset1-Carter2023/individual_assembly/Dry2014.dbCAN/dbcan-sub.hmm.out, there are 12947 rows but only 11827 proteins. That's because 894 proteins have multiple domains (each domain match has one row in the file). So this protein Dry2014_81126 has three domains: GH43_e159, GH43_e22, GH43_e159, and the domain positions (cols 11 and 12) are different in the full length.