Open Jigyasa3 opened 10 months ago
Hello Jigyasa,
I am not a developer, I only made a very small code contribution as a user, so I cannot answer your questions for sure. But here is my experience as someone who uses this program regularly and is familiar with the code:
The diamond.out
file is generated using diamond against a protein database with the parameter -k 1, which means that diamond will only return a single target CAZyme annotation per gene. It also uses a low evalue, -e 1e-102, to keep only good hits. So, this file is not filtered after running diamond and only includes the best hit.
The hmmer.out
file includes all valid hits based on HMMs. It seems like the overall best hit used for the final annotation is based on the HMM hit with the lowest evalue and highest coverage.
sub.prediction.out
gives substrate prediction for CGCs. Based on your response to q5, the reason you don't get this file is because you didn't get any predicted CGCs.
overview.txt
is a summary of individual protein annotations, while cgc.out
summarizes annotated gene clusters. I personally tend to use the annotation associated with each gene in cgc.out
as my personal "final" annotation but I don't know if that is the recommended usage.
I would say that the count of X.ofTools
is less important than making sure all of the annotations agree. Substrate prediction for whole clusters tends to be better than for individual enzymes. I think the larger issue is not getting any CGCs out of the software. I would try downloading a genome with CGCs from this group's dbCAN_seq database (https://bcb.unl.edu/dbCAN_seq/) and running that through your installation to make sure everything is working properly.
Best, Aaron
Thanks, Aaron, for providing all these excellent answers to Jigyasa. All your answers are correct, and I am happy that you really understand run_dbcan output despite our sloppy documentation in the readme. Just some additional information:
hmmer.out is already parsed with the best cazyme domain hits for each query protein.
overview.txt is the final cazyme annotation file. It is not for CGCs. Keeping those with >=2 tools is our recommendation. Not all cazymes are located in CGCs, so those not in CGCs but with support from >=2 tools are still highly likely cazymes. Even those without EC predictions are still good cazyme candidates.
Yes, dbsub.out can be used to extract predicted substrates for cazymes. I do not know why you didn't get any CGCs. One possible reason is that your query genome/contig is too fragmented and no CGCs are found.
Yanbin
From: Aaron Oliver @.> Sent: Wednesday, December 6, 2023 5:12 PM To: linnabrown/run_dbcan @.> Cc: Yanbin Yin @.>; Mention @.> Subject: Re: [linnabrown/run_dbcan] output explanations for "diamond.out", "hmmer.out","dtemp.out","overview.txt" (Issue #138)
Non-NU Email
Hello Jigyasa,
I am not a developer, I only made a very small code contribution as a user, so I cannot answer your questions for sure. But here is my experience as someone who uses this program regularly and is familiar with the code:
The diamond.out file is generated using diamond against a protein database with the parameter -k 1, which means that diamond will only return a single target CAZyme annotation per gene. It also uses a low evalue, -e 1e-102, to keep only good hits. So, this file is not filtered after running diamond and only includes the best hit.
The hmmer.out file includes all valid hits based on HMMs. It seems like the overall best hit used for the final annotation is based on the HMM hit with the lowest evalue and highest coverage.
sub.prediction.out gives substrate prediction for CGCs. Based on your response to q5, the reason you don't get this file is because you didn't get any predicted CGCs.
overview.txt is a summary of individual protein annotations, while cgc.out summarizes annotated gene clusters. I personally tend to use the annotation associated with each gene in cgc.out as my personal "final" annotation but I don't know if that is the recommended usage.
I would say that the count of X.ofTools is less important than making sure all of the annotations agree. Substrate prediction for whole clusters tends to be better than for individual enzymes. I think the larger issue is not getting any CGCs out of the software. I would try downloading a genome with CGCs from this group's dbCAN_seq database (https://bcb.unl.edu/dbCAN_seq/) and running that through your installation to make sure everything is working properly.
Best, Aaron
— Reply to this email directly, view it on GitHubhttps://urldefense.com/v3/__https://github.com/linnabrown/run_dbcan/issues/138*issuecomment-1843837535__;Iw!!PvXuogZ4sRB2p-tU!DvtG1_QLi_ouvYsWrftlWSE0Fb6VWkChOKXPjZ7v9aOE1u3oNgmZJNkqIguBfjQgsS2bMUcREuGKVtBKJXwYIw$, or unsubscribehttps://urldefense.com/v3/__https://github.com/notifications/unsubscribe-auth/AEXNKZVTMGLOY5EH657CCA3YID3WNAVCNFSM6AAAAABAKDRFXGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQNBTHAZTONJTGU__;!!PvXuogZ4sRB2p-tU!DvtG1_QLi_ouvYsWrftlWSE0Fb6VWkChOKXPjZ7v9aOE1u3oNgmZJNkqIguBfjQgsS2bMUcREuGKVtDh9s6cxA$. You are receiving this because you were mentioned.Message ID: @.***>
Thank you so much for replying @yinlabniu and @AaronAOliver ,
I am working with very fragmented plasmid genomes maybe that explains the lack of CGCs in the output. I just wanted to make sure that the code that I was using: run_dbcan ${IN_DIR}/${file1} prok --out_dir ${OUT_DIR}/ --db_dir ${DB_DIR}/ --use_signalP=TRUE -sp /shared/software/signalp --cgc_substrate
was correct.
I really appreciate the detailed replies. I will definitively test on a positive control sample to make sure that my version of installation is correct.
Thank you!
Hi @AaronAOliver and @yinlabniu ,
I took your advice and examined the installed software on a genome analyzed before (MGYG000002712).
I am getting the cgc.out
file and dbsub.out
file. So the code works!
a) But I don't understand the columns in cgc.out
file and how they would connect to the dbsub.out
file. For example, the software finds CGC1 to contain MGYG000002712_77_9
, the only protein with substrate annotation. But this protein has multiple domains GH5_e273
and CBM2_e118
which get annotated to degrade different substrates. Which one should be used?
b) What are the columns names in cgc.out
file?
Are the columns 7 and 8 genomic positions?
What does the column 11 annotation DB=gnl|TC-DB|Q48476|3.A.1.103.1;ID=MGYG000002712_5_16 DB=gnl|TC-DB|Q48476|
means?
c) Does the cgc.out
file needs to be filtered or can I summarize the results from this file as it is?
I am attaching the output files for reference incase my understanding is wrong. dbsub.out.txt cgc.out.txt
Looking forward to your reply!
Hi @AaronAOliver and @yinlabniu ,
I took your advice and examined the installed software on a genome analyzed before (MGYG000002712). I am getting the
cgc.out
file anddbsub.out
file. So the code works!a) But I don't understand the columns in
cgc.out
file and how they would connect to thedbsub.out
file. For example, the software finds CGC1 to containMGYG000002712_77_9
, the only protein with substrate annotation. But this protein has multiple domainsGH5_e273
andCBM2_e118
which get annotated to degrade different substrates. Which one should be used? b) What are the columns names incgc.out
file? Are the columns 7 and 8 genomic positions? What does the column 11 annotationDB=gnl|TC-DB|Q48476|3.A.1.103.1;ID=MGYG000002712_5_16 DB=gnl|TC-DB|Q48476|
means?c) Does the
cgc.out
file needs to be filtered or can I summarize the results from this file as it is?I am attaching the output files for reference incase my understanding is wrong. dbsub.out.txt cgc.out.txt
Looking forward to your reply!
please see https://github.com/linnabrown/run_dbcan/issues/127 for answer to dbsub.out.
for cgc.out, it is explained in https://bcb.unl.edu/dbCAN_seq_old/help.php. But, it is still hard to understand, that's why we made cgc_standard.out, which is simplified version of cgc.out. The cols in cgc_standard.out are CGC_id, type, contig_id, gene_id, start, end, strand, annotation.
Thank you very much! We already rewrote our read.me in readthedoc format. Please give us any suggestions and comments for it. In addition, we have updated our tools for additional multiple functions.
Hi @linnabrown , @yinlabniu , @AaronAOliver
Thank you for a great resource! I am looking through the output of run_dbcan script and was wondering if you could guide me in the right direction. My outputs are different from #127
script used-
run_dbcan ${IN_DIR}/${file1} prok --out_dir ${OUT_DIR}/ --db_dir ${DB_DIR}/ --use_signalP=TRUE -sp /shared/software/signalp --cgc_substrate
output-
diamond.out
- There appears to be a single output per Gene.ID. Does that mean that the file is already pre-processed for best hit and E.value and we can directly examine it without any filtration of results? It looks similar toblast.out
file in #127hmmer.out
- It has HMM profile per Gene.ID. But there are some Gene.IDs with multiple hits. What's the difference between hmmer.out and diamond.out? Is hmmer.out the final file for Gene.ID annotation or it needs to be filtered?dbsub.out
is also similar to #127 but the code I ran (as shown above) do not generatesub.prediction.out
.overview.txt
is what is mentioned in the README.md file. Is this the final file to examine the CGCs and substrates?overview.txt
file, there are 6 columns (namely,EC.
,HMMER
dbCAN_sub
DIAMOND
Signalp
X.ofTools
). How do I extract the best hit per Gene.ID? Can I say that ifX.ofTools
is more than 3, I can trust the annotation? Currently, I am filtering theoverview.txt
file to extract columns whereEC.
is not empty, and adding the substrate info fromdbsub.out
to the filteredoverview.txt
file. For example, one of the hits inoverview.txt
isGH1_e65
. This hit maps tobeta-galactan
substrate indbsub.out
file. Is that the correct way to proceed? At the same time, I don't have the CGCs output. Why do you think that's happening?Looking forward to your reply! Jigyasa