katholt / srst2

Short Read Sequence Typing for Bacterial Pathogens
Other
125 stars 65 forks source link

srst2.py throws error if allele has slash character #29

Closed ppcherng closed 9 years ago

ppcherng commented 9 years ago

I ran srst2 against a gene database that happened to have some slash characters for some of the read names:

70aec27/clpVaec27/clpV_EC042_0215__R033079 R033079 aec27/clpV (EC042_0215) - putative type VI secretion system protein [Escherichia coli str. 042 (EAEC O44:H18)]

ATGATCCAGATTGATTTAGCCACGCTGGTAAAGCGGCTTAACCCCTTTGCAAAACAGGCG ...

This causes srst2.py to crash when generating pileups:

Traceback (most recent call last): File "/usr/local/bin/srst2", line 9, in load_entry_point('srst2==0.1.5', 'console_scripts', 'srst2')() File "/usr/local/lib/python2.7/dist-packages/srst2/srst2.py", line 1548, in main db_reports, db_results = run_srst2(args,fileSets,args.gene_db,"genes") File "/usr/local/lib/python2.7/dist-packages/srst2/srst2.py", line 1102, in run_srst2 db_reports, db_results_list = process_fasta_db(args, fileSets, run_type, db_reports, db_results_list, fasta) File "/usr/local/lib/python2.7/dist-packages/srst2/srst2.py", line 1164, in process_fasta_db unique_gene_symbols, unique_allele_symbols,run_type,ST_db,results,gene_list,db_report,cluster_symbols,max_mismatch) File "/usr/local/lib/python2.7/dist-packages/srst2/srst2.py", line 1275, in map_fileSet_to_db unique_gene_symbols, unique_allele_symbols, pileup_file) File "/usr/local/lib/python2.7/dist-packages/srst2/srst2.py", line 859, in parse_scores allele_pileup_file = create_allele_pileup(top_allele, pileup_file) # XXX Creates a new pileup file for that allele. Not currently cleaned up File "/usr/local/lib/python2.7/dist-packages/srst2/srst2.py", line 765, in create_allele_pileup with open(outpileup, 'w') as allele_pileup: IOError: [Errno 2] No such file or directory: '67aec27/clpVaec27/clpV_G2583_0230R033067.ERR024627ERR024627.Escherichia_VF_clustered.pileup'

Possible solution is to just change the slashes to underscores

katholt commented 9 years ago

For now: Just replace slashes in your fasta file before using.

Permanent fix: Add this as a check point when parsing the gene directory. End run and report to user that they can't have slashes in gene names and that they need to fix before running. This is preferable to SRST2 trying to replace characters within the outputs as this could lead to compatibility problems for the user later on.