output explanations for "diamond.out", "hmmer.out","dtemp.out","overview.txt"

linnabrown / run_dbcan

Run_dbcan V4, using genomes/metagenomes/proteomes of any assembled organisms (prokaryotes, fungi, plants, animals, viruses) to search for CAZymes.

http://bcb.unl.edu/dbCAN2

GNU General Public License v3.0

138 stars 40 forks source link

output explanations for "diamond.out", "hmmer.out","dtemp.out","overview.txt" #138

Open Jigyasa3 opened 10 months ago

Jigyasa3 commented 10 months ago

Hi @linnabrown , @yinlabniu , @AaronAOliver

Thank you for a great resource! I am looking through the output of run_dbcan script and was wondering if you could guide me in the right direction. My outputs are different from #127

script used- run_dbcan ${IN_DIR}/${file1} prok --out_dir ${OUT_DIR}/ --db_dir ${DB_DIR}/ --use_signalP=TRUE -sp /shared/software/signalp --cgc_substrate

output-

diamond.out- There appears to be a single output per Gene.ID. Does that mean that the file is already pre-processed for best hit and E.value and we can directly examine it without any filtration of results? It looks similar to blast.out file in #127
hmmer.out- It has HMM profile per Gene.ID. But there are some Gene.IDs with multiple hits. What's the difference between hmmer.out and diamond.out? Is hmmer.out the final file for Gene.ID annotation or it needs to be filtered?
dbsub.out is also similar to #127 but the code I ran (as shown above) do not generate sub.prediction.out.
overview.txt is what is mentioned in the README.md file. Is this the final file to examine the CGCs and substrates?
In overview.txt file, there are 6 columns (namely, EC.,HMMER dbCAN_sub DIAMOND Signalp X.ofTools). How do I extract the best hit per Gene.ID? Can I say that if X.ofTools is more than 3, I can trust the annotation? Currently, I am filtering the overview.txt file to extract columns where EC. is not empty, and adding the substrate info from dbsub.out to the filtered overview.txt file. For example, one of the hits in overview.txt is GH1_e65. This hit maps to beta-galactan substrate in dbsub.out file. Is that the correct way to proceed? At the same time, I don't have the CGCs output. Why do you think that's happening?

Looking forward to your reply! Jigyasa

AaronAOliver commented 10 months ago

Hello Jigyasa,

I am not a developer, I only made a very small code contribution as a user, so I cannot answer your questions for sure. But here is my experience as someone who uses this program regularly and is familiar with the code:

The diamond.out file is generated using diamond against a protein database with the parameter -k 1, which means that diamond will only return a single target CAZyme annotation per gene. It also uses a low evalue, -e 1e-102, to keep only good hits. So, this file is not filtered after running diamond and only includes the best hit.
The hmmer.out file includes all valid hits based on HMMs. It seems like the overall best hit used for the final annotation is based on the HMM hit with the lowest evalue and highest coverage.
sub.prediction.out gives substrate prediction for CGCs. Based on your response to q5, the reason you don't get this file is because you didn't get any predicted CGCs.
overview.txt is a summary of individual protein annotations, while cgc.out summarizes annotated gene clusters. I personally tend to use the annotation associated with each gene in cgc.out as my personal "final" annotation but I don't know if that is the recommended usage.
I would say that the count of X.ofTools is less important than making sure all of the annotations agree. Substrate prediction for whole clusters tends to be better than for individual enzymes. I think the larger issue is not getting any CGCs out of the software. I would try downloading a genome with CGCs from this group's dbCAN_seq database (https://bcb.unl.edu/dbCAN_seq/) and running that through your installation to make sure everything is working properly.

Best, Aaron

yinlabniu commented 10 months ago

Thanks, Aaron, for providing all these excellent answers to Jigyasa. All your answers are correct, and I am happy that you really understand run_dbcan output despite our sloppy documentation in the readme. Just some additional information:

hmmer.out is already parsed with the best cazyme domain hits for each query protein.
overview.txt is the final cazyme annotation file. It is not for CGCs. Keeping those with >=2 tools is our recommendation. Not all cazymes are located in CGCs, so those not in CGCs but with support from >=2 tools are still highly likely cazymes. Even those without EC predictions are still good cazyme candidates.
Yes, dbsub.out can be used to extract predicted substrates for cazymes. I do not know why you didn't get any CGCs. One possible reason is that your query genome/contig is too fragmented and no CGCs are found.

Yanbin

From: Aaron Oliver @.> Sent: Wednesday, December 6, 2023 5:12 PM To: linnabrown/run_dbcan @.> Cc: Yanbin Yin @.>; Mention @.> Subject: Re: [linnabrown/run_dbcan] output explanations for "diamond.out", "hmmer.out","dtemp.out","overview.txt" (Issue #138)

Non-NU Email

Hello Jigyasa,

The diamond.out file is generated using diamond against a protein database with the parameter -k 1, which means that diamond will only return a single target CAZyme annotation per gene. It also uses a low evalue, -e 1e-102, to keep only good hits. So, this file is not filtered after running diamond and only includes the best hit.
The hmmer.out file includes all valid hits based on HMMs. It seems like the overall best hit used for the final annotation is based on the HMM hit with the lowest evalue and highest coverage.
sub.prediction.out gives substrate prediction for CGCs. Based on your response to q5, the reason you don't get this file is because you didn't get any predicted CGCs.
overview.txt is a summary of individual protein annotations, while cgc.out summarizes annotated gene clusters. I personally tend to use the annotation associated with each gene in cgc.out as my personal "final" annotation but I don't know if that is the recommended usage.
I would say that the count of X.ofTools is less important than making sure all of the annotations agree. Substrate prediction for whole clusters tends to be better than for individual enzymes. I think the larger issue is not getting any CGCs out of the software. I would try downloading a genome with CGCs from this group's dbCAN_seq database (https://bcb.unl.edu/dbCAN_seq/) and running that through your installation to make sure everything is working properly.

Best, Aaron

— Reply to this email directly, view it on GitHubhttps://urldefense.com/v3/__https://github.com/linnabrown/run_dbcan/issues/138*issuecomment-1843837535__;Iw!!PvXuogZ4sRB2p-tU!DvtG1_QLi_ouvYsWrftlWSE0Fb6VWkChOKXPjZ7v9aOE1u3oNgmZJNkqIguBfjQgsS2bMUcREuGKVtBKJXwYIw$, or unsubscribehttps://urldefense.com/v3/__https://github.com/notifications/unsubscribe-auth/AEXNKZVTMGLOY5EH657CCA3YID3WNAVCNFSM6AAAAABAKDRFXGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQNBTHAZTONJTGU__;!!PvXuogZ4sRB2p-tU!DvtG1_QLi_ouvYsWrftlWSE0Fb6VWkChOKXPjZ7v9aOE1u3oNgmZJNkqIguBfjQgsS2bMUcREuGKVtDh9s6cxA$. You are receiving this because you were mentioned.Message ID: @.***>

Jigyasa3 commented 10 months ago

Thank you so much for replying @yinlabniu and @AaronAOliver ,

I am working with very fragmented plasmid genomes maybe that explains the lack of CGCs in the output. I just wanted to make sure that the code that I was using: run_dbcan ${IN_DIR}/${file1} prok --out_dir ${OUT_DIR}/ --db_dir ${DB_DIR}/ --use_signalP=TRUE -sp /shared/software/signalp --cgc_substrate was correct. I really appreciate the detailed replies. I will definitively test on a positive control sample to make sure that my version of installation is correct.

Thank you!

Jigyasa3 commented 9 months ago

Hi @AaronAOliver and @yinlabniu ,

I took your advice and examined the installed software on a genome analyzed before (MGYG000002712). I am getting the cgc.out file and dbsub.out file. So the code works!

a) But I don't understand the columns in cgc.out file and how they would connect to the dbsub.out file. For example, the software finds CGC1 to contain MGYG000002712_77_9, the only protein with substrate annotation. But this protein has multiple domains GH5_e273 and CBM2_e118 which get annotated to degrade different substrates. Which one should be used? b) What are the columns names in cgc.out file? Are the columns 7 and 8 genomic positions? What does the column 11 annotation DB=gnl|TC-DB|Q48476|3.A.1.103.1;ID=MGYG000002712_5_16 DB=gnl|TC-DB|Q48476| means?

c) Does the cgc.out file needs to be filtered or can I summarize the results from this file as it is?

I am attaching the output files for reference incase my understanding is wrong. dbsub.out.txt cgc.out.txt

Looking forward to your reply!

yinlabniu commented 9 months ago

Hi @AaronAOliver and @yinlabniu ,

I took your advice and examined the installed software on a genome analyzed before (MGYG000002712). I am getting the cgc.out file and dbsub.out file. So the code works!

a) But I don't understand the columns in cgc.out file and how they would connect to the dbsub.out file. For example, the software finds CGC1 to contain MGYG000002712_77_9, the only protein with substrate annotation. But this protein has multiple domains GH5_e273 and CBM2_e118 which get annotated to degrade different substrates. Which one should be used? b) What are the columns names in cgc.out file? Are the columns 7 and 8 genomic positions? What does the column 11 annotation DB=gnl|TC-DB|Q48476|3.A.1.103.1;ID=MGYG000002712_5_16 DB=gnl|TC-DB|Q48476| means?

c) Does the cgc.out file needs to be filtered or can I summarize the results from this file as it is?

I am attaching the output files for reference incase my understanding is wrong. dbsub.out.txt cgc.out.txt

Looking forward to your reply!

please see https://github.com/linnabrown/run_dbcan/issues/127 for answer to dbsub.out.

for cgc.out, it is explained in https://bcb.unl.edu/dbCAN_seq_old/help.php. But, it is still hard to understand, that's why we made cgc_standard.out, which is simplified version of cgc.out. The cols in cgc_standard.out are CGC_id, type, contig_id, gene_id, start, end, strand, annotation.

linnabrown commented 8 months ago

Thank you very much! We already rewrote our read.me in readthedoc format. Please give us any suggestions and comments for it. In addition, we have updated our tools for additional multiple functions.

https://dbcan.readthedocs.io/en/latest/index.html