katholt / srst2

Short Read Sequence Typing for Bacterial Pathogens
Other
123 stars 65 forks source link

Error in step: Using the VFDB Virulence Factor Database with SRST2 #59

Closed schultzm closed 8 years ago

schultzm commented 8 years ago

Error in step: Using the VFDB Virulence Factor Database with SRST2

1: The URL to the VFDB has changed and, using wget, should now be:

wget http://www.mgc.ac.cn/VFs/Down/VFDB_setB_nt.fas.gz

2: We need three extra steps for the database parsing scripts to work.

i) gunzip VFDB_setB_nt.fas.gz ii) mv VFDB_setB_nt.fas VFDB_setB_nt.fas.ffn iii) save the following code as convert.py and execute as python convert.py, which gets rid of the "(gi:xxxxx)" bit in the sequence headers and the commas (replacing with '_') because downstream processing splits on the commas:

with open("VFDB_setB_nt.fas.ffn", "r") as input_handle:
    with open("VFDB_setB_nt.fas_corrected.ffn", "w") as output_handle:
        for line in input_handle:
            if '(gi:' in line:
                line=line.replace('(gi:', ' (gi:').replace(',','_')
                line = line.split()
                output_handle.write(line[0]+' '+' '.join(line[2:])+'\n')
            else:
                output_handle.write(line)

Without removing the text "(gi:xxxxx)" from the header, this next step will not work:

python VFDB_cdhit_to_csv.py --cluster_file Clostridium_cdhit90.clstr --infile Clostridium.fsa --outfile Clostridium_cdhit90.csv

Also, there is a redundant comment on line 46 in the script VFDB_cdhit_to_csv.py (should read "VFGxxxxxx"):

schultzm@x:gene_dbs $ grep 'the unique ID R0xxx' ~/VFDB_cdhit_to_csv.py -n
46:         database[ClusterNr].append(seqID) # for virulence gene DB, this is the unique ID R0xxx
katholt commented 8 years ago

Would be great if you could just put your edits into the original script and push the changes.

On 22 March 2016 at 3:36:04 pm, Mark Schultz (notifications@github.commailto:notifications@github.com) wrote:

Actually, it's a bit more than the above. Need to remove the whole "(gi:xxxxxxxx)" bit. Here's how I did it using python:

with open("VFDB_setB_nt.fas.ffn", "r") as input_handle: with open("VFDB_setB_nt.fas.ffn_edit", "w") as output_handle: for line in input_handle: if ' (gi:' in line: line = line.split() output_handle.write(line[0]+' '+' '.join(line[2:])+'\n') else: output_handle.write(line)

— You are receiving this because you are subscribed to this thread. Reply to this email directly or view it on GitHubhttps://github.com/katholt/srst2/issues/59#issuecomment-199634498

rrwick commented 8 years ago

Fixed in https://github.com/katholt/srst2/commit/5b1639be854e77f2375a1e8b7d09fae6ba5cf653 - thanks!