katholt / srst2

Short Read Sequence Typing for Bacterial Pathogens
Other
123 stars 65 forks source link

VFDB clusterid naming convention problem, KeyError: '51__map__map_ATCC' #67

Closed toddknutson closed 8 years ago

toddknutson commented 8 years ago

Hi,

I cloned your github code on 6/20/16, which contained the latest fixes for creating SRST2 compatible files using the updated VFDB downloads. However, when following the directions on your database_clustering/readme.md page, my final VF_clustered.fasta file contained an incorrectly formatted [culster]__[gene]__[allele]__[VFDB] name. If this file is used with SRST2, the program breaks with a python dictionary error: KeyError: '51__map__map_ATCC', indicating this key does not exist.

I found that my final VF_clustered.fasta file contained a sequence with header:

>51__map__map_ATCC 25904__VFG001799 VFG001799(gi:8648965) (map) extracellular proteins Map [Eap/Map (VF0016)] [Staphylococcus aureus str. Newman D2C (ATCC 25904)]

However, this is formatted wrong, and there should not be a space between ATCC and 25904. To fix the problem, I manually deleted the space and re-ran SRST2. This corrected the problem. However, I think one of the scripts used to parse the headers made an error in this case. I have not investigated which script or the code to find the bug. But I wanted to let you know. Thanks for a great tool!! The full error traceback is below:

Traceback (most recent call last):
  File "/panfs/roc/groups/10/marthal4/knut0297/software/srst2/bin/srst2", line 9, in <module>
    load_entry_point('srst2==0.1.8', 'console_scripts', 'srst2')()
  File "/home/marthal4/knut0297/software/srst2/lib/python2.7/site-packages/srst2/srst2.py", line 1711, in main
    db_reports, db_results = run_srst2(args,fileSets,args.gene_db,"genes")
  File "/home/marthal4/knut0297/software/srst2/lib/python2.7/site-packages/srst2/srst2.py", line 1248, in run_srst2
    db_reports, db_results_list = process_fasta_db(args, fileSets, run_type, db_reports, db_results_list, fasta)
  File "/home/marthal4/knut0297/software/srst2/lib/python2.7/site-packages/srst2/srst2.py", line 1310, in process_fasta_db
    unique_gene_symbols, unique_allele_symbols,run_type,ST_db,results,gene_list,db_report,cluster_symbols,max_mismatch)
  File "/home/marthal4/knut0297/software/srst2/lib/python2.7/site-packages/srst2/srst2.py", line 1471, in map_fileSet_to_db
    column_header = cluster_symbols[cluster_id]
KeyError: '51__map__map_ATCC'
rrwick commented 8 years ago

Todd,

Thanks for spotting that one. The VFDB changed their FASTA file format, and my new parsing approach was grabbing the wrong allele name for that one. The allelle name shouldn't be map_ATCC 25904but rather map_VF0016.

I've fixed VFDB_cdhit_to_csv.py again to deal with all gene/allele names properly (I hope!), so if you pull from the SRST2 master branch, it should work properly now! There's no need to rerun your analysis (your space-removal trick is fine), but be aware that the allele name will change if you generate the SRST2 database again.

Ryan

toddknutson commented 8 years ago

Hi Ryan,

Great, thanks for the update. And thanks for SRST2, it has allowed us to make a very interesting discovery that would not have been possible without your software!

Todd

On Jul 20, 2016, at 12:33 AM, Ryan Wick notifications@github.com wrote:

Todd,

Thanks for spotting that one. The VFDB changed their FASTA file format, and my new parsing approach was grabbing the wrong allele name for that one. The allelle name shouldn't be map_ATCC 25904but rather map_VF0016.

I've fixed VFDB_cdhit_to_csv.py again to deal with all gene/allele names properly (I hope!), so if you pull from the SRST2 master branch, it should work properly now! There's no need to rerun your analysis (your space-removal trick is fine), but be aware that the allele name will change if you generate the SRST2 database again.

Ryan

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/katholt/srst2/issues/67#issuecomment-233842154, or mute the thread https://github.com/notifications/unsubscribe-auth/AHAaPoNYen2ZgcWdl7fL3fl8NJIYVgUpks5qXbMRgaJpZM4JM250.