Summary of our work with Seeker and ProphET which we want to propose in this pull request:
Seeker
Things we've done:
Default Seeker interface doesn't support the selection of the pre-trained model type (we suspect it’s ‘metagenome’ model by default but we need a prophage model) and specific output file.
To overcome this limitation, we wrote a new command-line interface based on argparse. It runs Seeker as a Python package to wrap all these things up into one command. This is the run_seeker.py script located in the scripts folder. Script uses a model for prophages by default (--lstm prophage).
We added a YAML file that can be used to set up conda environment for the Seeker
We prepared a straightforward snakemake file, based on phageboost.smk
Issues:
No issues were detected
ProphET
Things we've done:
We simplified the ProphET installation using conda. This means that all dependencies (these from perl too) can be automatically installed, with no superuser privileges. Environment for the tool is specified in prophet.yaml located in conda_environments folder.
ProphET requires a lot of input pre-processing
We have written a set of rules, specific to ProphET in prophET_rules.smk:
Sequence names should be purified from non-alpha-numeric characters except underscore (specifically dots)
We decided to correct this on the genbank level using the sanitize_genbank.py script (located in the scripts folder). The script creates temporary .gbk files with changed accession numbers. Any dots in accessions are replaced with the _dot_ string. The rule for this action is specified in sanitize_accesions. To remain consistent with the names from the original .gbk files, we revert these changes just before creating a .tsv file in the prophet_2_tbl rule.
The tool requires a specifically structured GFF file
It was surprisingly difficult to convert genbank files to .gff files compatible with the tool. We have found that bp_genbank2gff3.pl from bioperl produces the most correct .gff files. Unfortunately, the script doesn't recognize compressed genbank files and require additional post-processing (unpack_gbk rule). The file generated by the script must be standardised to ProphET needs. This is done using a script provided by ProphET's authors - gff_rewrite.pl (get_prophet_gff rule). To remain consistent with the file system convention and enable the file conversion before installation, we moved the whole folder with the gff tools library - ProphET/UTILS/GFFLib to the scripts folder.
Issues:
Analysis of files that have been previously analyzed to the same output folder might cause a crash (it happened only once and we don't know why).
Perl package Bio/Graphics can't be installed using conda, so ProphET doesn't generate images with potential prophage sites locations in the genome. It also prints a warning during analysis. It doesn't affect predictions in any other way, so it can be safely ignored.
Summary of our work with Seeker and ProphET which we want to propose in this pull request:
Seeker
Things we've done:
run_seeker.py
script located in thescripts
folder. Script uses a model for prophages by default (--lstm prophage
).Issues:
ProphET
Things we've done:
prophet.yaml
located inconda_environments
folder.We have written a set of rules, specific to ProphET in
prophET_rules.smk
:sanitize_genbank.py
script (located in thescripts
folder). The script creates temporary .gbk files with changed accession numbers. Any dots in accessions are replaced with the_dot_
string. The rule for this action is specified insanitize_accesions
. To remain consistent with the names from the original .gbk files, we revert these changes just before creating a .tsv file in theprophet_2_tbl
rule.bp_genbank2gff3.pl
from bioperl produces the most correct .gff files. Unfortunately, the script doesn't recognize compressed genbank files and require additional post-processing (unpack_gbk
rule). The file generated by the script must be standardised to ProphET needs. This is done using a script provided by ProphET's authors -gff_rewrite.pl
(get_prophet_gff
rule). To remain consistent with the file system convention and enable the file conversion before installation, we moved the whole folder with the gff tools library -ProphET/UTILS/GFFLib
to thescripts
folder.Issues: