Enhancements in vfdb_parser.py for VFDB full dataset support

Currently, when using the getref vfbd_full (...) command downloading the full VFDB dataset, it is not possible to proceed with ariba preparef (...) using the resulting reference data without manual changes to both the .fa and the .tsv files. This is, because the reference data set contains several pitfalls not adressed yet:

Duplicate sequence IDs raising errors in ariba prepareref.
Sequences with stop codons, which are filtered out. (The metadata.tsv file created by vfdb_parser is currently declaring every sequence from the dataset as gene).
Gene symbols including brackets or blank spaces, so the intended naming is not working for every sequence complicating the creation of meaningful cluster names.

The modifications proposed here adress all shortcomings mentioned above. Furthermore, the xls-derived metadata from VFDB explaining function and mechanism of a respective virulence gene (VFs.xls.gz, see VFs description file on VFDB download page) are included into the metadata.tsv derived from vfdb_parser to allow a more comprehensive view of the ariba variant calling results for working with VFDB.

Thank you for considering to merge for a future release.

sanger-pathogens / ariba

Enhancements in vfdb_parser.py for VFDB full dataset support #320