katholt / srst2

Short Read Sequence Typing for Bacterial Pathogens
Other
123 stars 65 forks source link

UnboundLocalError for VFDB_cdhit_to_csv.py #52

Closed sarahpenir closed 8 years ago

sarahpenir commented 8 years ago

Good day,

Upon running the VFDB_cdhit_to_csv.py against my cluster file, the following error ensued:

"Traceback (most recent call last): File "../database_clustering/VFDB_cdhit_to_csv.py", line 67, in sys.exit(main()) File "../database_clustering/VFDB_cdhit_to_csv.py", line 61, in main outstring = ",".join([seqID, clusterid, gene, allele, str(record.seq), re.sub(",","",record.description)]) + "\n" UnboundLocalError: local variable 'clusterid' referenced before assignment"

What could have caused the error?

Thank you very much, Sarah

aphayt commented 8 years ago

Hi Sarah

Last night I had the same problem when I was trying to set up a virulence gene database for Salmonella. And I got an identical error message. Have you found a solution to the issue?

Many thanks, Yue

sarahpenir commented 8 years ago

Hi @aphayt,

I was able to make the program work by modifying the "main" function of VFDB_cdhit_to_csv.py with the following code:

def main():

    args = parse_args()
    outfile = file(args.outfile,"w")
    outfile.write("seqID,clusterid,gene,allele,DNA,annotation\n")

    database = {} # key = clusterid, value = list of seqIDs
    seq2cluster = {} # key = seqID, value = clusterid

    for line in open(args.cluster_file):
        if line.startswith(">"):
            ClusterNr = line.split()[1]
            continue

        line_split =  line.split(">")
        seqID = line_split[1].split("(")[0]

        if ClusterNr not in database:
            database[ClusterNr] = []
        if seqID not in database[ClusterNr]:
            database[ClusterNr].append(seqID) # for virulence gene DB, this is the unique ID R0xxx
        seq2cluster[seqID] = ClusterNr
    for record in SeqIO.parse(open(args.infile, "r"), "fasta"):
        clusterid = ""      
        full_name = record.description
        genus = full_name.split("[")[2].split()[0]
        id_bits = re.sub("[()]","",full_name.split("[")[0]).split() # 'R004852 fliL VP2243 '
        seqID = full_name.split()[0].split("(")[0] # R004852
        gene = id_bits[1] # fliL

        if len(id_bits) > 2:
            allele = id_bits[1]+"_"+id_bits[2] # fliL_VP2243
        else:
            allele = id_bits[1]
        if seqID in seq2cluster:
            clusterid = seq2cluster[seqID]
        outstring = ",".join([seqID, clusterid, gene, allele, str(record.seq), re.sub(",","",record.description)]) + "\n"
        outfile.write(outstring)
    outfile.close()

Hope this helps, Sarah P.

aphayt commented 8 years ago

Hi Sarah

Many thanks for sharing. I have made two VF databases: Campylobacter and Salmonella after following the steps in 'Error in step: Using the VFDB Virulence Factor Database with SRST2' #59.

Best, Yue

rrwick commented 8 years ago

Fixed in https://github.com/katholt/srst2/commit/5b1639be854e77f2375a1e8b7d09fae6ba5cf653 - thanks!