Is it possible for dbCAN to output the protein sequence of CAZyme hits?

namsca commented 3 years ago

Hi,

I have run dbCAN on all of the bacterial genomes for one specific genus and would like to make a reference database of all the CAZymes in this genus to use as a reference for short metagenomic reads. My intention is to run diamond blastx against this database as was suggested by you in an answer to one of the other issues, but I am wondering if there is an easy way to output these files?

I see that uniInput has the translated genes, and overview.txt has the position of the CAZyme, but I'm wondering if it is possible for me to slightly edit the code such that it also produces one more file, a protein fasta file which contains the name of the CAZyme and the sequence, based on the position given by HMMscan if it meets some criteria such as being predicted by all three tools.

Thanks in advance for your answer and for making such an excellent tool!

Kind regards, Nick

yinlabniu commented 3 years ago

Yes, you certainly can write a simple script in python or other languages to extract the dbCAN domains found by hmmscan. You will need the input.faa and overview.txt files.

Yanbin

From: namsca notifications@github.com Sent: Tuesday, January 26, 2021 4:53 PM To: linnabrown/run_dbcan Cc: Subscribed Subject: [linnabrown/run_dbcan] Is it possible for dbCAN to output the protein sequence of CAZyme hits? (#64)

Non-NU Email

Hi,

I have run dbCAN on all of the bacterial genomes for one specific genus and would like to make a reference database of all the CAZymes in this genus to use as a reference for short metagenomic reads. My intention is to run diamond blastx against this database as was suggested by you in an answer to one of the other issues, but I am wondering if there is an easy way to output these files?

I see that uniInput has the translated genes, and overview.txt has the position of the CAZyme, but I'm wondering if it is possible for me to slightly edit the code such that it also produces one more file, a protein fasta file which contains the name of the CAZyme and the sequence, based on the position given by HMMscan if it meets some criteria such as being predicted by all three tools.

Thanks in advance for your answer and for making such an excellent tool!

Kind regards, Nick

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHubhttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_linnabrown_run-5Fdbcan_issues_64&d=DwMCaQ&c=Cu5g146wZdoqVuKpTNsYHeFX_rg6kWhlkLF8Eft-wwo&r=f65eEPN7tgPSqkv5z4zNJA&m=KsiWMt3vCgqJRJtinPBFIcAArBzsquD4oPCBu5ggnTI&s=40K0mPPYlNiijbnpla3IqmHPMn0tminzpKcCDowimh4&e=, or unsubscribehttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_notifications_unsubscribe-2Dauth_AEXNKZXANB7X2CCQUTNILPTS35BYFANCNFSM4WUH4DNQ&d=DwMCaQ&c=Cu5g146wZdoqVuKpTNsYHeFX_rg6kWhlkLF8Eft-wwo&r=f65eEPN7tgPSqkv5z4zNJA&m=KsiWMt3vCgqJRJtinPBFIcAArBzsquD4oPCBu5ggnTI&s=9FvXXuxg9YTRma5igmfcaRZLhWZg6gtt15lsorxMqOM&e=.

namsca commented 3 years ago

Thanks for getting back to me so quickly! I am a python beginner so I'm sure there's a better way to do this but in case anyone saw this and was having the same problem, I made a newline separated list (names.txt) of the first column in overview.txt for rows which had a prediction that used all three tools, and then did the following in python:

from Bio import SeqIO

my_list = []
with open("names.txt") as names:
    for line in names:
        my_list.extend(line.split(' '))

fin = open('uniInput', 'r')
fout = open('CAZy_reference.fasta', 'w')

for record in SeqIO.parse(fin,'fasta'):
    for item in my_list:
        if item.strip() == record.id:
            fout.write(">" + record.id + "\n")
            fout.write(str(record.seq + "\n"))

fin.close()
fout.close()

linnabrown / run_dbcan

Is it possible for dbCAN to output the protein sequence of CAZyme hits? #64