WSPC package can be installed via one of the following options:
pip install wspc
conda install -c zivukelsongroup wspc
Dependencies:
In windows: make sure that the python "Scripts\" directory is added to PATH, so that the package can be executed as a command
usage: wspc [-h] [-m {predict,fit}] -i I [-o OUTPUT] [-l LABELS_PATH] [--model_path MODEL_PATH] [-k K] [-t T]
optional arguments:
-h, --help show this help message and exit
-m {predict,fit}, --mode {predict,fit}
-i I input directory with genome *.txt files or a merged input *.fasta file
-o OUTPUT, --output OUTPUT
output directory, default current directory
-l LABELS_PATH, --labels_path LABELS_PATH
path to *.csv file with labels
--model_path MODEL_PATH
path to a saved model in a *.pkl file. If not provided, saved pre-trained model will be used
-k K parameter for training - selecting k-best features using chi2
-t T parameter for training - clustering threshold
You can predict the pathogenicity potentials of group of genomes using a saved model in a *.pkl file. If a path is not provided, saved pre-trained model will be used. The WSPC pre-trained model can be found in https://github.com/shakedna1/wspc_rep/blob/main/src/wspc/model/WSPC_model.pkl.
wspc -m predict -i path_to_input_genomes
Train a new model using the fit command.
You can train a new model using the same k (selecting k-best features using chi2) and t (clustering threshold) values of WSPC (450 and 0.18 respectively) or using a different values of your choice.
wspc -m fit -i path_to_input_genomes -l path_to_labels -k 450 -t 0.18
Download and extract the WSPC dataset (WSPC train set & WSPC test set) from https://github.com/shakedna1/wspc_rep/raw/main/Data/train_test_datasets.zip In Ubuntu:
wget https://github.com/shakedna1/wspc_rep/raw/main/Data/train_test_datasets.zip
unzip train_test_datasets.zip
Train:
wspc -m fit -i train_genomes.fasta -l train_genomes_info.csv -k 450 -t 0.18
The file trained_model.pkl will be saved in the same directory (or in the directory provided through the -o argument)
Test:
wspc -m predict -i test_genomes.fasta --model_path trained_model.pkl
The file predictions.csv will contain the predictions
Below are a detailed running examples of WSPC as a python module:
import wspc
X_train = wspc.read_genomes(path_to_genomes)
y = wspc.read_labels(path_to_labels, X_train)
model = wspc.fit(X_train, y, k=450, t=0.18)
X_test = wspc.read_genomes(path_to_genomes)
predictions = wspc.predict(X_test, model)
import wspc
model_path - path to a saved model in a *.pkl file. If not provided, saved pre-trained model will be used
model = wspc.load_model(model_path)
X_test = wspc.read_genomes(path_to_genomes)
predictions = wspc.predict(X_test, model)
WSPC handle different types of input:
Input directory with genome .tab and\or .txt files:
*.tab file - Public genomes on PATRIC database are available through a genomes directory. Each genome directory includes a .features.tab file, which provides all genomic features and related information in tab-delimited format, including PGFams information. For features.tab file example, look at the file: https://github.com/shakedna1/wspc_rep/blob/main/Data/Bacpacs/patric_files/1041522.28.PATRIC.features.tab
*.txt file - Output file of the PATRIC annotation service for new genome. For more detailes on the file and the annotation service, see explanation at the section: "Obtain PATRIC Global Protein Families (PGFams) annotations for new sequenced genome" below.
Merged input .fasta file: A merged file in a fasta format that contains concatenation of the PGFams information, which can be extracted from a .tab file using the field: pgfam_id and from a *.txt file using the fiels: "pgfam".
For the merged file exact format, look at the file: https://github.com/shakedna1/wspc_rep/blob/main/Data/train_genomes.fasta
Example of the fasta file content:
>1346.123
PGF_10048015
PGF_00062045
PGF_00409415
PGF_00766022
PGF_02011026
X
X
X
PGF_07480521
PGF_01162199
PGF_03475877
PGF_00876106
PGF_06473395
PGF_06429692
PGF_00007012
PGF_04788810
Note that any protein family annotation IDs can be used, e.g., COGs, eggNOGs etc.
PATRIC Provides Global Protein Families (PGFams) annotations service for new genomes. In order to generate PGFams annotations file for a new sequenced genome:
Use PATRIC's Genome Annotation Service: https://patricbrc.org/app/Annotation.
For detailed instructions, Follow the instructions under the PATRIC genome annotations service documentation: https://docs.patricbrc.org/user_guides/services/genome_annotation_service.html
Download the resulting "Taxonomy name + label".txt file (click on view, then download. "Taxonomy name + label" is the genome name).
If you wish to create a merged *.fasta file for number of genomes, the column "pgfam" will be used for pgfam extraction.