These instructions will get you a copy of the project up and running on your local machine for development and testing purposes. See deployment for notes on how to deploy the project on a live system.
You would need to install the following software before replicating this framework in your local or server machine.
Java JDK
Python version 3.5+
Aanaconda version 3+
Retrieve the code
git clone https://github.com/manisa/ClassifyTE.git
cd ClassifyTE
Create and activate the virtual environment with python dependendencies.
conda env create -f environment.yml python==3.7
conda activate ClassifyTE_env
ClassifyTE/
models/
ClassifyTE_combined.pkl
ClassifyTE_repbase.pkl
ClassifyTE_pgsb.pkl
To run the program on test TE sequence:
ClassifyTE/
data/
demo.fasta
python generate_feature_file.py -f demo.fasta -d demo_features -o demo_features.csv
For generate_feature_file.py, the user has to provide two parameters:
python evaluate.py -f demo_features.csv -n node.txt -d demo_features -m ClassifyTE_combined.pkl -a lcpnb
For evaluate.py, the user has to provide following parameters:
-f for feature file name which is by default feature_file.csv unless user have provided a feature filename in earlier step. -d for feature folder name which is by default features unless user have provided a feature folder name while generating features.
-n for node filename which is by default node.txt. Node file consists of numerical representation of taxonomy of the dataset. Please check nodes folder for other node files for each dataset.
-m for model filename which has .pkl as file extension. All the model files must have been added in models directory.
-a for algorithm choice (lcpnb or nllcpn)
Finally, check files inside output folder for predicted label of the TE sequence/s.
To run the program on new TE sequence:
ClassifyTE/
data/
[your_fasta_file]
python generate_feature_file.py -f your_fasta_file_name -o your_feature_file_name -d your_feature_directory
For generate_feature_file.py, the user has to provide two parameters:
-f for fasta filename from data directory.
-d for feature folder name if you want to replace the name of the feature directory so as to generate features for multiple fasta sequences.
-o for resulting feature file name with .csv extension [Optional] [By default the feature filename is feature_file.csv.]
Then run following python command from the root directory to get the prediction on new TE sequences. Prior following command, please make sure that all the model files have already been added to models directory.
python evaluate.py -f your_feature_file_name -d your_feature_directory -n node_file -m model_name
For evaluate.py, the user has to provide following parameters:
-f for feature file name which is by default feature_file.csv unless user have provided a feature filename in earlier step.
-n for node filename which is by default node.txt. Node file consists of numerical representation of taxonomy of the dataset. Each node file is associated with the respective models trained on respective datasets. Please check below under nodes section for details.
-m for model filename which has .pkl as file extension. All the model files must have been added in models directory.
Finally, check predicted_result.csv file inside output folder for predicted label of the TE sequence/s.
ClassifyTE/
data/
pgsb_feature_file.csv
repbase_feature_file.csv
combined.csv
python train.py -f csv_file_name -n txt_node_file -m model_filename -c SVM_cost_parameter -g SVM_gamma_parameter
For train.py, the user has to provide following parameters:
-f for feature file name.
-m for model file name
-n for node filename. Node file consists of numerical representation of taxonomy of the dataset.
-c for cost parameter of SVM with RBF kernel
-g for gamma parameter of SVM with RBF kernel
We have optimized cost and gamma parameters of SVM with RBF kernel for all three datasets. The cost and gamma parameters for training each datasets would be different. You will have to pass the hyper-parameters accordingly.
For PGSB dataset : C=32, gamma=0.03125
For REPBASE dataset : C=128.0, gamma=0.0078125
For combined dataset : C=512.0, gamma=0.0078125
Under node directory you will find three files. These files consists of all the nodes in each of the corresponding datasets. These node files consist of numerical representation of taxonomy in the datasets.
If you would like to train the model on your machine, the training example would look like as below:
python train.py -f combined.csv -n node.txt -m ClassifyTE_combined -c 512.0 -g 0.0078125
Manisha Panta, Avdesh Mishra, Md Tamjidul Hoque, Joel Atallah
This project is licensed under the MIT License - see the LICENSE.md file for details
[1] Nakano, F.K., et al. Top-down Strategies for Hierarchical Classification of Tranposable Elements with Neural Networks. In, IEEE. 2017.