Closed asishallab closed 3 weeks ago
prot-scriber has a new version and thus its annotations have to be re-done and evaluated again.
Do this for the three reference proteomes. Start with Faba and P. coccineus.
Put all scripts in the /mnt/data/asis/prot-scriber/evaluation/scripts
directory, except for MetaEuk analyses, that should go into the /mnt/data/asis/prot-scriber/evaluation/MetaEuk_batches/scripts
directory.
Some BLAST runs were executed differently, meaning that the order of columns in the BLAST result tables is different between the Blast tables for the three protein sets.
To see which columns appear in which order inspect the Blast running script, e.g.
Pcoccineus_vs_swissprot_oge_job.sh
for Blast on P. coccineus searching for hits in the SwissProt DB.
You probably must use the
-e, --header <header>
prot-scriber argument-p, --field-separator <field-separator
-n
number of threads argument - recommended five to ten.
arguments.
General Information
Data and code directory on the server:
/mnt/data/asis/prot-scriber
Note that in the following all relative paths are to be rooted in this directory.R-Code for evaluation:
prot.scriber-evaluation_R
Executableexec/measurePerformance.R
can be executed withRscript exec/measurePerformance.R
Rust-Code of production version of prot-scriber:
prot-scriber-Rust
can be executed with/target/release/prot-scriber --help
Note: You can link (
ln -s
) to the above executable in your$PATH
...General approach
The following evaluation procedure is implemented in the R-script mentioned below. The script
Install the prot-scriber R version
Change to the project directory and open R
In an interactive R-shell execute:
Finally in the BASH-shell execute
gold standard data
This is the data, we'll use prot-scriber on and will evaluate it with.
Directory of evaluation data:
/mnt/data/asis/prot-scriber/evaluation
We have three data-sets that at the time of starting the evaluation were not in UniProt yet:
Reference annotations
We compare the words in prot-scriber annotations with the words in reference annotations. Mind you, that "annotations" means protein function predictions in the form of short human readable descriptions (HRDs) generated by prot-scriber, Pfam-A annotations generated by using HMMER3 on each of the above protein sets, and finally by using Mercator [1] to generate MapMan4 [2] annotations.
For each of the three above protein sets you find the respective annotation files.
For P. coccineus:
Pcoccineus_mercator_v4_results.txt
the Mercator annotationsPcoccineus_vs_PfamA_hmmscan_out.tsv
the PFam annotationsFor Faba:
Faba_mercator_results.txt
the Mercator annotationsFaba_vs_PfamA_hmmscan_out.tsv
the PFam annotationsMetaEuk: Note that MetaEuk for performance measures has been processed in batches (sub-sets). We used eight batches.
MetaEuk_batches/Mercator_MapMan4_annotations
for Mercator (MapMan4) reference annotationsMetaEuk_preds_Tara_vs_euk_profiles_uniqs_short_IDs_vs_PfamA_batch_1.txt
(replace batch_1 with your batch no) for PFam A annotationsprot-scriber input data
You know that prot-scriber consumes BLAST (or Diamond, modern very fast BLAST reimplementation) outputs to generate its protein function predictions in the form of short human readable descriptions (HRDs).
The above Blast output tables that prot-scriber consumes have been generated using UniProtKB databases from April 2021.
If you run BLAST (Diamond) at any point again, you must use the Blast databases in the following folder:
/mnt/data/asis/UniProt/previous/20210408
, because those do not yet contain the above reference proteins.Blast results for the respective reference proteins
For P coccinues
Pcoccineus_vs_swisprot_blastp.txt
Pcoccineus_vs_trembl_blastp.txt
For Faba:
Faba_vs_swisprot_blastp.txt
Faba_vs_trembl_blastp.txt
For MetaEuk Batches, e.g.
batch_1
im OrdnerMetaEuk_batches
:MetaEuk_preds_Tara_vs_euk_profiles_uniqs_short_IDs_vs_Swissprot_batch_1.txt
MetaEuk_preds_Tara_vs_euk_profiles_uniqs_vs_trembl_blastp_batch_1.txt
The job management system
Read the manual provided by our system administrators!
Most important commands:
qsub
to submit a script to the job ystemqstat
to see the status of your running jobs. Useqstat -a
("all") to see terminated jobs, too.qhost
to see available hosts (nodes), i.e. compute serversTo run a script that e.g. executes the evaluation R-script on prot-scriber annotations generated for the MetaEuk batch_1 see:
./evaluation/MetaEuk_batches/scripts/measure_prot-scriber_performance_on_MetaEuk_batch_1_oge.sh
Copy such a script and adjust to your needs. Consider the header:
References