katholt / srst2

Short Read Sequence Typing for Bacterial Pathogens
Other
123 stars 65 forks source link

VFDB_cdhit_to_csv.py throws KeyError for Mycobacterium #25

Closed ppcherng closed 9 years ago

ppcherng commented 9 years ago

I am following the directions from https://github.com/katholt/srst2#using-the-vfbd-virulence-factor-database-with-srst2 to generate a VFDB gene database for Mycobacterium. I have already run these commands:

python VFDBgenus.py --infile CP_VFs.ffn cd-hit -i Mycobacterium.fsa -o Mycobacterium_cdhit90 -c 0.9 > Mycobacterium_cdhit90.stdout (FYI, I am using cd-hit-v4.6.1-2012-08-27)

But when I run this command: python VFDB_cdhit_to_csv.py --cluster_file Mycobacterium_cdhit90.clstr --infile Mycobacterium.fsa --outfile Mycobacterium_cdhit90.csv

I get this error: Traceback (most recent call last): File "/install/srst2/database_clustering/VFDB_cdhit_to_csv.py", line 66, in sys.exit(main()) File "/install/srst2/database_clustering/VFDB_cdhit_to_csv.py", line 59, in main clusterid = seq2cluster[seqID] KeyError: 'R027152'

ppcherng commented 9 years ago

Investigated further, it appears that R027152 in Mycobacterium.fsa is an empty FASTA entry with no sequences:

R027152 mps1 (Rv0101) - Probable peptide synthetase Nrp (peptide synthase) [Mycobacterium tuberculosis str. H37Rv] R005431 fbpC (Rv0129c) - SECRETED ANTIGEN 85-C FBPC (85C) (ANTIGEN 85 COMPLEX C) (AG58C) (MYCOLYL TRANSFERASE 85C) (FIBRONECTIN-BINDING PROTEIN C) [Mycobacterium tuberculosis str. H37Rv]

Not sure what the best solution here is, perhaps just skip keys that don't exist in seq2cluster[seqID]? Something like this above line 59 in VFDB_cdhit_to_csv.py

if seqID not in seq2cluster: continue

ppcherng commented 9 years ago

I put in this workaround fix that also prints warning messages:

            if seqID not in seq2cluster:
                    print "Warning: %s not found in seq2cluster dictionary" % seqID
                    continue
            clusterid = seq2cluster[seqID]

After rerunning the VFDB_cdhit_to_csv.py command with this fix, I actually got quite a few warning messages:

Warning: R027152 not found in seq2cluster dictionary Warning: R028506 not found in seq2cluster dictionary Warning: R028767 not found in seq2cluster dictionary Warning: R028662 not found in seq2cluster dictionary Warning: R028713 not found in seq2cluster dictionary Warning: R027232 not found in seq2cluster dictionary Warning: R026954 not found in seq2cluster dictionary Warning: R029114 not found in seq2cluster dictionary Warning: R026854 not found in seq2cluster dictionary Warning: R029140 not found in seq2cluster dictionary Warning: R029193 not found in seq2cluster dictionary Warning: R029062 not found in seq2cluster dictionary Warning: R028895 not found in seq2cluster dictionary Warning: R028861 not found in seq2cluster dictionary Warning: R027310 not found in seq2cluster dictionary Warning: R027359 not found in seq2cluster dictionary Warning: R027458 not found in seq2cluster dictionary Warning: R027407 not found in seq2cluster dictionary Warning: R027980 not found in seq2cluster dictionary Warning: R028033 not found in seq2cluster dictionary Warning: R028087 not found in seq2cluster dictionary Warning: R028142 not found in seq2cluster dictionary Warning: R027523 not found in seq2cluster dictionary Warning: R027099 not found in seq2cluster dictionary Warning: R027041 not found in seq2cluster dictionary Warning: R029035 not found in seq2cluster dictionary Warning: R028287 not found in seq2cluster dictionary Warning: R028288 not found in seq2cluster dictionary Warning: R028556 not found in seq2cluster dictionary Warning: R028609 not found in seq2cluster dictionary Warning: R027619 not found in seq2cluster dictionary Warning: R028982 not found in seq2cluster dictionary Warning: R028928 not found in seq2cluster dictionary Warning: R027894 not found in seq2cluster dictionary Warning: R027923 not found in seq2cluster dictionary Warning: R027951 not found in seq2cluster dictionary Warning: R027674 not found in seq2cluster dictionary Warning: R027701 not found in seq2cluster dictionary Warning: R027729 not found in seq2cluster dictionary Warning: R027757 not found in seq2cluster dictionary Warning: R027796 not found in seq2cluster dictionary Warning: R027827 not found in seq2cluster dictionary Warning: R027859 not found in seq2cluster dictionary Warning: R027578 not found in seq2cluster dictionary Warning: R028819 not found in seq2cluster dictionary Warning: R028454 not found in seq2cluster dictionary Warning: R028403 not found in seq2cluster dictionary Warning: R028351 not found in seq2cluster dictionary Warning: R028225 not found in seq2cluster dictionary Warning: R028196 not found in seq2cluster dictionary Warning: R028261 not found in seq2cluster dictionary Warning: R029248 not found in seq2cluster dictionary Warning: R029279 not found in seq2cluster dictionary Warning: R029311 not found in seq2cluster dictionary Warning: R029348 not found in seq2cluster dictionary

I spot checked several of these seqIDs in the Mycobacterium.fsa file and they all appear to be empty FASTA entries with no sequences.

katholt commented 9 years ago

Fix posted, thanks ppcherng