Closed asishallab closed 2 years ago
Use R package stored here:
/mnt/data/asis/prot-scriber/prot.scriber
General approach for all reference word sets should be to allow exclusion of certain words by regular expressions, e.g. the word "domain" or "protein" does not convey information about a protein's function. Use regular expression blacklist blacklist.word.regexs
(from file ./inst/blacklist_token.txt
).
Both annotations from Mercator4 and HMMER3 on PfamA can assign multiple annotations to a query protein. Note that in Mercator4 that is an extremely rare event.
First issue is to parse the Mercator4 result. Consider using (copying) readMercatorResultTable
from here:
m.dt <- fread(path.2.mercator.result.tbl, sep = "\t", header = TRUE, stringsAsFactors = FALSE,
na.strings = "", quote = "")
Implement post-processing of column TYPE
to become a boolean:
m.dt$TYPE <- # ...
Optional argument forces filtering for last column to be TRUE.
Then iterate over the unique (distinct) query proteins (can have more than one matching row) and extract all NAME
and DESCRIPTION
annotations. Split those into words using any white-space character and .
as separators, lower-case the words, filter them with the regexs, sort them and retain them as reference words in an R list
:
pc.ref.words.mercator4 <- list( `query.gene.id`=character( 'phosphatase', 'alien', 'oxidator' )
Utility functions should go into any file in ./R/
, e.g. funks.R
. Write a simple executable Rscript (example ./exec/loadPcoccineusSeqSimSearchResults.R
) that parses the Mercator4 results of P. coccineus and stores the above reference words in an RData.
Ignore all annotations that are in root BIN 35
, because they are just best Blast descriptions.
In annotations that are in root BIN 50
throw away the best Blast Hit marked with a leading &
, e.g.
Enzyme classification.EC_1 oxidoreductases.EC_1.1 oxidoreductase acting on CH-OH group of donor(50.1.1 : 1064.8) & Probable mannitol dehydrogenase OS=Fragaria ananassa (sp|q9zrf1|mtdh_fraan : 478.0)
Delete everything after the &
until the end of the string.
The rest should go into the reference word sets.
Consider sifting through a few example rows not using the default split-regex in function wordSet
but just \\s+
instead.
Using annotations obtained from Mercator (MapMan Bin Ontology) and Pfam-A annotations as gold standard
Implement functions that extract reference human readable descriptions from predictions obtained from Mercator4 (MapMan ontology) and HMMER3 against PfamA. Split these reference annotations into word sets. Make them available as RData in our research R package.