Closed amilesj closed 7 years ago
Arianna,
Thanks for letting us know! Parsing the VFDB file has always been tricky because the header lines aren't very consistent. I think I've fixed the issue with a change I just pushed up to the SRST2 master branch, so if you clone/pull SRST2 and follow these instructions again, I think it should work.
Now when it encounters a FASTA header like this:
>VFG000739(gb|AAC38392) (eae) intimin [Intimin (VF0177)]
It will call that gene eae
instead of gb|AAC38392
.
But your manual removal of the gb|
parts should also work, so no need to repeat anything if your current results are good.
Thanks again!
Hello,
I ran into issues when trying to use SRST2 with a database made from VFDB. I followed the instructions you have posted for using VFDB and ended up with a fasta file of genes from the Escherichia genus that looked good to my (novice) eye. However, when I ran SRST2 using this gene database the SRST2 script got stuck after logging "Printing verbose gene detection results..." in the log, and printed this to the terminal:
NP_752600 is a gene name in the gene database. I pinpointed the issue to be in the
grep
command in line 1502 of the source code. Some of the gene names in my VFDB database had vertical bars in the header, which caused problems withgrep
. Here is an example of a problematic header in the clustered fasta file that I used as my gene_db:I used
sed
to remove the "gb|" prefixes from my gene_db and that fixed the issue. Was not sure if this is something you wanted to incorporate into the code you provide for preparing a SRST2 compatible database from VFDB, or maybe include a warning in the instructions to look out for this.Best, Arianna