h1. Automated Assignment of Human Readable Descriptions (AHRD)
Short descriptions in sequence databases are useful to quickly gain insight into important information about a sequence, for example in search results. We developed a new program called “Automatic assignment of Human Readable Descriptions” (AHRD) with the aim to select descriptions and Gene Ontology terms that are concise, informative and precise. AHRD outperforms competing methods and can overcome problems caused by wrong annotations, lack of similar sequences and partial alignments.
h2. Table of contents
h2. 1 Getting started
h3. 1.1 Requirements
AHRD is a Java-Program which requires @Java 1.7@ or higher and @ant@.
h3. 1.2 Installation
h4. 1.2.1 Get AHRD
Copy (clone) AHRD to your computer using git via command-line, then change into AHRD's directory, and finally use the latest stable version:
git clone https://github.com/groupschoof/AHRD.git cd AHRD git checkout tags/v3.3.3
Alternativelly without using @git@, you can download AHRD version @v3.3.3@ ("zip":https://github.com/groupschoof/AHRD/archive/v3.3.3.zip or "tar.gz":https://github.com/groupschoof/AHRD/archive/v3.3.3.tar.gz) and extract it.
h4. 1.2.2 Build the executable jar
Running
ant dist
will create the executable JAR-File: @./dist/ahrd.jar@
h2. 2 Usage
All AHRD-Inputs are passed to AHRD in a single YML-File. See @./ahrd_example_input.yml@ for details. (About YAML-Format see Wikipedia/YAML)
Basically AHRD needs a FASTA-File of amino acid sequences and different files containing the results from the respective BLAST searches, in our example we searched three databases: Uniprot/trEMBL, Uniprot/Swissprot and TAIR10. Note, that AHRD is generic and can make use of any number of different Blast databases that do not necessarily have to be the above ones. If e.g. annotating genes from a fungal genome searching yeast databases might be more recommendable than using TAIR (Arabidopsis thaliana).
All parameters can be set manually, or the default ones can be used as given in the example input file @ahrd_example_input.yml@ (see sections "2.1":#21-ahrd-example-usages and "3.2":#32-used-formulae-and-parameters for more details).
In order to parallelize the protein function annotation processes, AHRD can be run on batches of recommended size between 1,000 to 2,000 proteins. If you want to annotate very large protein sets or have low memory capacities use the included Batcher to split your input-data into Batches of appropriate size (see section "2.3":#23-batcher). Note: As of Java 7 or higher AHRD is quite fast and batching might no longer be necessary.
h3. 2.1 AHRD example usages
There are two template AHRD input files provided that you should use according to your use case. All example input files are stored in @./test/resources@ and are named @ahrd_example_input*.yml@. You can run AHRD on any of these use cases with
java -Xmx2g -jar ./dist/ahrd.jar your_use_case_file.yml
| Use Case | Template File | | Annotate your Query proteins with Human Readable Descriptions (HRD) | @./test/resources/ahrd_example_input.yml@ | | Annotate your Query proteins with HRD and Gene Ontology (GO) terms | @./test/resources/ahrd_example_input_go_prediction.yml@ |
h3. 2.2 Input
Example files for all input files can be found under @./test/resources/@. NOTE: Only files containing @example@ in their filename should be used as template input files. Other YAML files are used for testing purposes.
h4. 2.2.1 Required input data
(If you run AHRD in batches the blast search results need to be batched in the same way as the fasta files.)
Recommended Sequence Similarity Search:
For your query proteins you should start independent BLAST searches e.g. in the three different databases mentioned above:
blastp -outfmt 6 -query query_sequences_AA.fasta -db uniprot_swissprot.fasta -out query_vs_swissprot.txt
h4. 2.2.2 Optional input data
If you want AHRD to predict your query protein's functions with Gene Ontology (GO) terms, you need to provide the GO annotations of reference proteins. See section "3.3.2":#332-parameters-controlling-gene-ontology-term-annotations for more details.
h4. 2.2.3 Required config files
h5. 2.2.3.1 Test custom blacklists and filters
As explained in 2.2.3 AHRD makes use of blacklists and filters provided as Java regular expressions. You can test your own custom blacklists and filters:
Example Output for test string "activity", and regular expressions "(?i)interacting", and "(?i)activity" applied in serial:
[junit] activity [junit] (?i)interacting -> activity [junit] (?i)activity ->
The above example demonstrates how the first regular expression does not match anything in the test string "activity", but after matching it against the second regular expression nothing remains, because the matched substring has been filtered out. As you can see, this test applies all provided regular expression in order of appearance and shows what remains of the provided test string after filtering with the provided regular expressions.
h3. 2.3 Batcher
AHRD provides a function to generate several input.yml files from large datasets, consisting of batches of query proteins. For each of these batches the user is expected to provide the batch's query proteins in FASTA format, and one Blast result file for each database searched. The AHRD batcher will then generate a unique input.yml file and entry in a batch shell script to execute AHRD on the respective batches in parallel for example on a compute cluster using a batch-system like LSF. We recommend this for very large datasets (more than a genome) or computers with low RAM.
To generate the mentioned input.yml files and batcher shell script that can subsequently be used to start AHRD in parallel use the batcher function as follows:
java -cp ./dist/ahrd.jar ahrd.controller.Batcher ./batcher_input_example.yml
You will have to edit @./batcher_input_example.yml@ and provide the following arguments. Note, that in the mentioned directories each file will be interpreted as belonging to one unique Batch, if and only if they have identical file names.
Batch-Name requirement: All above explained files belonging to the same Batch must have the same name. This name must start with alpha-numeric characters and may finish with digits indicating the Batch's number. File extensions are allowed to be varying.
h3. 2.4 Output
AHRD supports two different formats. The default one is a tab-delimited table. The other is FASTA-Format.
h4. 2.4.1 Tab-Delimited Table
AHRD writes out a CSV table with the following columns:
AHRD's quality-code consists of a three character string, where each character is either '' if the respective criteria is met or '-'* otherwise. Their meaning is explained in the following table:
| Position | Criteria | | 1 | Bit score of the blast result is >50 and e-value is <e-10 | | 2 | Overlap of the blast result is >60% | | 3 | Top token score of assigned HRD is >0.5 |
h4. 2.4.2 Fasta-Format
To set AHRD to write its output in FASTA-Format set the following switch in the input.yml:
output_fasta: true
AHRD will write a valid FASTA-File of your query-proteins where the Header will be composed of the same parts as above, but here separated by whitespaces.
h3. 2.5 AHRD run using BLASTX results
In order to run AHRD on BLASTX results instead of BLASTP results you have to modify the following parameters in the ahrd_example_input.yml:
token_score_bit_score_weight: 0.5 token_score_database_score_weight: 0.3 token_score_overlap_score_weight: 0.2
Since the algorithm is based on protein sequences and the BLASTX searches are based on nucleotide sequence there will be a problem calculating the overlap score of the blast result. To overcome this problem the token_score_overlap_score_weight has to be set to 0.0. Therefore the other two scores have to be raised. These three parameters have to sum up to 1. The resulting parameter configuration could look like this:
token_score_bit_score_weight: 0.6 token_score_database_score_weight: 0.4 token_score_overlap_score_weight: 0.0
h3. 2.6 Parameter Optimization
If you have well annotated reference proteins (at least 1000) from an organism closely related to the proteins you want to annotate, you can optimize AHRD's parameters to reproduce protein function predictions more alike to your references. In order to do so you need to start a simulated annealing approach:
java -cp dist/ahrd.jar ahrd.controller.Trainer trainer_example_input.yml
Note: See @./test/resources/trainer_example_input.yml@ for details.
Basically the input file is almost identical to the standard AHRD run. Only the following additional parameters are available; all numeric values given show the default used for the respective parameter:
h4. 2.6.1 Optimization in parallel (Trainer-Batcher)
Simulated Annealing is a time and resource consuming process. Because AHRD uses quite a number of parameters, the search space is accordingly large. Hence it is recommendable to start independend simulated annealing runs in parallel. Do this using
java -cp ./dist/ahrd.jar ahrd.controller.TrainerBatcher ./test/resources/trainer_batcher_example.yml
The following parameters are specific to the Trainer-Batcher:
Note, that the Trainer-Batcher works very much like the AHRD-Batcher (section "2.3":#23-batcher). Particularly you need to respect the batch-name requirements explained there.
The above Trainer-Batcher example input shows how to automatically generated a desired number of input files with different starting points in parameter space. These input files can then directly be used with the above documented AHRD Trainer (section "2.6":#26-parameter-optimization).
h3. 2.7 Computing F-Scores for selected parameter sets (AHRD-Evaluator)
Having different parameter sets AHRD enables you to compute their performance in terms of F-Scores for each reference protein. Optionally you can also revise the theorectically maximum attainable F-Score and see how well the best Hits from each sequence similarity search perform. In order to do so, use the Evaluator function:
java -cp ./dist/ahrd.jar ahrd.controller.Evaluator ./evaluator_example.yml
See @./test/resources/evaluator_example.yml@ for more details.
Parameters specific to the Evaluator function are:
Optionally you can apply AHRD's balcklisting and filtering on the Reference Descriptions, too. This results in performance only being assessed on non-blacklistes and filtered descriptions, of which only meaningful words, i.e. non blacklisted tokens will be scored. In order to do so, you need to set all of the three following parameters. Note, that you can point any of them to empty dummy files, if you wish.
h2. 3 Algorithm
Based on e-values the 200 best scoring blast results are chosen from each database-search (e.g. Swissprot, TAIR, trEMBL). For all resulting candidate description lines a score is calculated using a lexical approach. First each description line is passed through two regular expression filters. The first filter discards any matching description line in order to ignore descriptions like e.g. 'Whole genome shotgun sequence', while the second filter tailors the description lines deleting matching parts, in order to discard e.g. the trailing Species-Descriptions 'OS=Arabidopsis thaliana [...]". In the second step of the scoring each description line is split into single tokens, which are passed through a blacklist filter, ignoring all matching tokens in terms of score. Tokens are sequences of characters with a collective meaning. For each token a score is calculated from three single scores with different weights, the bit score, the database score and the overlap score. The bit score is provided within the blast result. The database score is a fixed score for each blast database, based on the description quality of the database. The overlap score reflects the overlap of the query and subject sequence. In the second step the sum of all token scores from a description line is divided by a correction factor that avoids the scoring system from being biased towards longer or shorter description lines. From this ranking now the best scoring description line can be chosen. In the last step a domain name provided by InterProScan results, if available, is extracted and appended to the best scoring description line for each uncharacterized protein. In the end for each uncharacterized protein a description line is selected that comes from a high-scoring BLAST match, that contains words occurring frequently in the descriptions of highest scoring BLAST matches and that does not contain meaningless "fill words". Each HRD line will contain an evaluation section that reflects the significance of the assigned human readable description.
h3. 3.1 Pseudo-Code
h3. 3.2 Used Formulae and Parameters
!https://raw.githubusercontent.com/groupschoof/AHRD/master/images/formulae.jpg!
h3. 3.3 Parameters
Above formulae use the following parameters as given in ./ahrd_example_input.yml. These parameters can either be used as provided or can be adapted to your needs.
The weights in the above formulae are
token_score_bit_score_weight: 0.468 token_score_database_score_weight: 0.2098 token_score_overlap_score_weight: 0.3221
and Blast-Database specific, for example for UniprotKB/Swissprot:
weight: 653 description_score_bit_score_weight: 2.717061
h4. 3.3.1 Parameters controlling the parsing of tabular sequence similarity search result tables (legacy BLAST, BLAST+, and BLAT)
AHRD has been designed to work with tabular sequence similarity search results in tabular format. By default it parses the tabular "Blast 8" output:
| Sequence Similarity Search Tool | recommended output format (@command line switch@) | | Blast+ | @-outfmt 6@ | | legacy Blast | @-m 8@ | | Blat | @-out=blast8@ |
The following paramters can be set optionally, if your tabular sequence similarity search result deviates from the above "Blast 8" format. See example file @./test/resources/ahrd_input_seq_sim_table.yml@ for more details.
| Optional Parameter | example | meaning of parameter | | seq_sim_search_table_comment_line_regex | @"#"@ | single character that starts a to be ignored comment line | | seq_sim_search_table_sep | @"\t"@ | single character that separates columns | | seq_sim_search_table_query_col | @10@ | number of column holding the query protein's accession | | seq_sim_search_table_subject_col | @11@ | number of column holding the Hit protein's accession | | seq_sim_search_table_query_start_col | @16@ | number of column holding the query's start amino acid position in the local alignment | | seq_sim_search_table_query_end_col | @17@ | number of column holding the query's end amino acid position in the local alignment | | seq_sim_search_table_subject_start_col | @18@ | number of column holding the Hit's start amino acid position in the local alignment | | seq_sim_search_table_subject_end_col | @19@ | number of column holding the Hit's end amino acid position in the local alignment | | seq_sim_search_table_e_value_col | @20@ | number of column holding the Hit's E-Value | | seq_sim_search_table_bit_score_col | @21@ | number of column holding the Hit's Bit-Score |
NOTE: All above column numbers start counting with zero, i.e. the first column has number 0.
h4. 3.3.2 Parameters controlling Gene Ontology term annotations
AHRD is capable of annotating the Query proteins with Gene Ontology (GO) terms. It does so, by transferring the reference GO terms found in the Blast Hit AHRD selects as source of the resulting HRD. To be able to pass these reference GO terms AHRD needs a reference GO annotation file (GOA). By default AHRD expects this GOA file to be in the standard Uniprot format. You can download the latest GOA file from the "Uniprot server":http://ftp.ebi.ac.uk/pub/databases/GO/goa/UNIPROT/. To obtain GO annotations for all UniprotKB proteins download file @goa_uniprot_all.gaf.gz@ (last visit Feb 16th 2017)
To have AHRD annotate your proteins with GO terms, you just need to provide the optional parameter @gene_ontology_result@. See example file @./test/resources/ahrd_input_seq_sim_table_go_prediction.yml@ for more details.
h5. 3.3.2.0 Prefer reference proteins as candidates that have GO Term annotations
The parameter @prefer_reference_with_go_annos: true@ is highly recommended when aiming to annotate your query proteins with GO Terms. If this parameter is set to true only those candidate references are considered that also have GO Term annotations. However, if you put more emphasis on Human Readable Descriptions and are prepared to accept a couple of your queries to not get any GO Term predictions you can switch this off with @prefer_reference_with_go_annos: false@ or just omit the parameter as by default it is set to @false@.
h5. 3.3.2.1 Custom reference Gene Ontology annotations (non UniprotKB GOA)
Unfortunately UniprotKB GOA files use short protein accessions like @W9QFR0@, while the UniprotKB FASTA databases use the long protein accessions with pipes like this @tr|W9QFR0|W9QFR0_9ROSA@. In order to enable AHRD to match the correct short and long accessions, and thus the reference GO annotations it uses regular expressions to parse both the long accessions and the GOA files. By default AHRD is setup to handle Uniprot formats, but you can provide custom regular expressions for your own custom GOA files:
Note: You must provide the above named match groups @shortAccession@ and @goTerm@, respectively.
h2. 4 Testing
If you want to run the complete JUnit Test-Suite execute:
ant
If you want to execute a test run on example proteins use:
ant test.run
h2. 5 License
See attached file LICENSE.txt for details.
h2. 6 Authors
Dr. Asis Hallab, Kathrin Klee, Florian Boecker, Dr. Girish Srinivas, and Prof. Dr. Heiko Schoof
INRES Crop Bioinformatics University of Bonn Katzenburgweg 2 53115 Bonn Germany
h2. 7 References
High quality genome projects that used AHRD:
fn1. Young, Nevin D., Frédéric Debellé, Giles E. D. Oldroyd, Rene Geurts, Steven B. Cannon, Michael K. Udvardi, Vagner A. Benedito, et al. “The Medicago Genome Provides Insight into the Evolution of Rhizobial Symbioses.” Nature 480, no. 7378 (December 22, 2011): 520–24. doi:10.1038/nature10625.
fn2. The Tomato Genome Consortium. “The Tomato Genome Sequence Provides Insights into Fleshy Fruit Evolution.” Nature 485, no. 7400 (May 31, 2012): 635–41. doi:10.1038/nature11119.
fn3. International Wheat Genome Sequencing Consortium (IWGSC). “A Chromosome-Based Draft Sequence of the Hexaploid Bread Wheat (Triticum Aestivum) Genome.” Science (New York, N.Y.) 345, no. 6194 (July 18, 2014): 1251788. doi:10.1126/science.1251788.
fn4. International Barley Genome Sequencing Consortium, Klaus F. X. Mayer, Robbie Waugh, John W. S. Brown, Alan Schulman, Peter Langridge, Matthias Platzer, et al. “A Physical, Genetic and Functional Sequence Assembly of the Barley Genome.” Nature 491, no. 7426 (November 29, 2012): 711–16. doi:10.1038/nature11543.
fn5. Wang, W., G. Haberer, H. Gundlach, C. Gläßer, T. Nussbaumer, M. C. Luo, A. Lomsadze, et al. “The Spirodela Polyrhiza Genome Reveals Insights into Its Neotenous Reduction Fast Growth and Aquatic Lifestyle.” Nature Communications 5 (2014): 3311. doi:10.1038/ncomms4311.
fn6. Guo, Shaogui, Jianguo Zhang, Honghe Sun, Jerome Salse, William J. Lucas, Haiying Zhang, Yi Zheng, et al. “The Draft Genome of Watermelon (Citrullus Lanatus) and Resequencing of 20 Diverse Accessions.” Nature Genetics 45, no. 1 (January 2013): 51–58. doi:10.1038/ng.2470.
AHRD was applied on all plant genomes present in the public database PlantsDB:
fn7. Spannagl, Manuel, Thomas Nussbaumer, Kai C. Bader, Mihaela M. Martis, Michael Seidel, Karl G. Kugler, Heidrun Gundlach, and Klaus F. X. Mayer. “PGSB PlantsDB: Updates to the Database Framework for Comparative Plant Genome Research.” Nucleic Acids Research 44, no. D1 (January 4, 2016): D1141–47. doi:10.1093/nar/gkv1130.