Error with gene database made from VFDB: vertical bars in headers

Hello,

I ran into issues when trying to use SRST2 with a database made from VFDB. I followed the instructions you have posted for using VFDB and ended up with a fasta file of genes from the Escherichia genus that looked good to my (novice) eye. However, when I ran SRST2 using this gene database the SRST2 script got stuck after logging "Printing verbose gene detection results..." in the log, and printed this to the terminal:

sh: NP_752600_VF0228__VFG000923: command not found

NP_752600 is a gene name in the gene database. I pinpointed the issue to be in the grep command in line 1502 of the source code. Some of the gene names in my VFDB database had vertical bars in the header, which caused problems with grep. Here is an example of a problematic header in the clustered fasta file that I used as my gene_db:

>135__gb|NP_752600gb|NP_752600_VF0228VFG000923 VFG000923(gb|NP_752600) (fepA) ferrienterobactin outer membrane transporter [Enterobactin (VF0228)] [Escherichia coli CFT073]

I used sed to remove the "gb|" prefixes from my gene_db and that fixed the issue. Was not sure if this is something you wanted to incorporate into the code you provide for preparing a SRST2 compatible database from VFDB, or maybe include a warning in the instructions to look out for this.

Best, Arianna

katholt / srst2

Error with gene database made from VFDB: vertical bars in headers #75