usadellab / prot-scriber

Assigns short human readable descriptions to biological sequences or gene families using references. For this, prot-scriber consumes sequence similarity search results in tabular format (Blast or Diamond).
GNU General Public License v3.0
5 stars 5 forks source link

Performance evaluation #37

Closed asishallab closed 3 weeks ago

asishallab commented 1 year ago

General Information

Data and code directory on the server: /mnt/data/asis/prot-scriber Note that in the following all relative paths are to be rooted in this directory.

R-Code for evaluation: prot.scriber-evaluation_R Executable exec/measurePerformance.R can be executed with Rscript exec/measurePerformance.R

Rust-Code of production version of prot-scriber: prot-scriber-Rust can be executed with /target/release/prot-scriber --help

Note: You can link (ln -s) to the above executable in your $PATH ...

General approach

The following evaluation procedure is implemented in the R-script mentioned below. The script

Install the prot-scriber R version

Change to the project directory and open R

cd /mnt/data/asis/prot-scriber/prot.scriber-evaluation_R
R

In an interactive R-shell execute:

install.packages(c('data.table', 'optparse', 'brew', 'seqinr', 'ggplot2', 'RColorBrewer'))
q()

Finally in the BASH-shell execute

R CMD INSTALL .

gold standard data

This is the data, we'll use prot-scriber on and will evaluate it with.

Directory of evaluation data: /mnt/data/asis/prot-scriber/evaluation

We have three data-sets that at the time of starting the evaluation were not in UniProt yet:

Reference annotations

We compare the words in prot-scriber annotations with the words in reference annotations. Mind you, that "annotations" means protein function predictions in the form of short human readable descriptions (HRDs) generated by prot-scriber, Pfam-A annotations generated by using HMMER3 on each of the above protein sets, and finally by using Mercator [1] to generate MapMan4 [2] annotations.

For each of the three above protein sets you find the respective annotation files.

For P. coccineus:

For Faba:

MetaEuk: Note that MetaEuk for performance measures has been processed in batches (sub-sets). We used eight batches.

prot-scriber input data

You know that prot-scriber consumes BLAST (or Diamond, modern very fast BLAST reimplementation) outputs to generate its protein function predictions in the form of short human readable descriptions (HRDs).

The above Blast output tables that prot-scriber consumes have been generated using UniProtKB databases from April 2021.

If you run BLAST (Diamond) at any point again, you must use the Blast databases in the following folder: /mnt/data/asis/UniProt/previous/20210408, because those do not yet contain the above reference proteins.

Blast results for the respective reference proteins

For P coccinues

For Faba:

For MetaEuk Batches, e.g. batch_1 im Ordner MetaEuk_batches:

The job management system

Read the manual provided by our system administrators!

Most important commands:

To run a script that e.g. executes the evaluation R-script on prot-scriber annotations generated for the MetaEuk batch_1 see: ./evaluation/MetaEuk_batches/scripts/measure_prot-scriber_performance_on_MetaEuk_batch_1_oge.sh

Copy such a script and adjust to your needs. Consider the header:

#!/bin/bash
#$ -l mem_free=4G,h_vmem=4G
#$ -pe smp 20
#$ -e /mnt/data/asis/prot-scriber/evaluation/MetaEuk_batches/scripts/measure_prot-scriber_performance_on_MetaEuk_batch_1_oge.err
#$ -o /mnt/data/asis/prot-scriber/evaluation/MetaEuk_batches/scripts/measure_prot-scriber_performance_on_MetaEuk_batch_1_oge.out

References

  1. Lohse, M., Nagel, A., Herter, T., May, P., Schroda, M., Zrenner, R., Tohge, T., Fernie, A. R., Stitt, M., & Usadel, B. (2014). Mercator: A fast and simple web server for genome scale functional annotation of plant sequence data. Plant, Cell & Environment, 37(5), 1250–1258. https://doi.org/10.1111/pce.12231
  2. Schwacke, R., Ponce-Soto, G. Y., Krause, K., Bolger, A. M., Arsova, B., Hallab, A., Gruden, K., Stitt, M., Bolger, M. E., & Usadel, B. (2019). MapMan4: A refined protein classification and annotation framework applicable to multi-omics data analysis. Molecular Plant. https://doi.org/10.1016/j.molp.2019.01.003
asishallab commented 1 year ago

Rerun prot-scriber and re-evaluate

prot-scriber has a new version and thus its annotations have to be re-done and evaluated again.

Do this for the three reference proteomes. Start with Faba and P. coccineus.

Steps

  1. Run prot-scriber again
  2. Run evaluation R-script again

Put all scripts in the /mnt/data/asis/prot-scriber/evaluation/scripts directory, except for MetaEuk analyses, that should go into the /mnt/data/asis/prot-scriber/evaluation/MetaEuk_batches/scripts directory.

Tips

Some BLAST runs were executed differently, meaning that the order of columns in the BLAST result tables is different between the Blast tables for the three protein sets.

To see which columns appear in which order inspect the Blast running script, e.g. Pcoccineus_vs_swissprot_oge_job.sh for Blast on P. coccineus searching for hits in the SwissProt DB.

You probably must use the