katholt / srst2

Short Read Sequence Typing for Bacterial Pathogens
Other
123 stars 65 forks source link

Error with gene database made from VFDB: vertical bars in headers #75

Closed amilesj closed 7 years ago

amilesj commented 7 years ago

Hello,

I ran into issues when trying to use SRST2 with a database made from VFDB. I followed the instructions you have posted for using VFDB and ended up with a fasta file of genes from the Escherichia genus that looked good to my (novice) eye. However, when I ran SRST2 using this gene database the SRST2 script got stuck after logging "Printing verbose gene detection results..." in the log, and printed this to the terminal:

sh: NP_752600_VF0228__VFG000923: command not found

NP_752600 is a gene name in the gene database. I pinpointed the issue to be in the grep command in line 1502 of the source code. Some of the gene names in my VFDB database had vertical bars in the header, which caused problems with grep. Here is an example of a problematic header in the clustered fasta file that I used as my gene_db:

>135__gb|NP_752600gb|NP_752600_VF0228VFG000923 VFG000923(gb|NP_752600) (fepA) ferrienterobactin outer membrane transporter [Enterobactin (VF0228)] [Escherichia coli CFT073]

I used sed to remove the "gb|" prefixes from my gene_db and that fixed the issue. Was not sure if this is something you wanted to incorporate into the code you provide for preparing a SRST2 compatible database from VFDB, or maybe include a warning in the instructions to look out for this.

Best, Arianna

rrwick commented 7 years ago

Arianna,

Thanks for letting us know! Parsing the VFDB file has always been tricky because the header lines aren't very consistent. I think I've fixed the issue with a change I just pushed up to the SRST2 master branch, so if you clone/pull SRST2 and follow these instructions again, I think it should work.

Now when it encounters a FASTA header like this: >VFG000739(gb|AAC38392) (eae) intimin [Intimin (VF0177)] It will call that gene eae instead of gb|AAC38392.

But your manual removal of the gb| parts should also work, so no need to repeat anything if your current results are good.

Thanks again!