Generate reference sets

asishallab commented 3 years ago

Implement functions that extract reference human readable descriptions from predictions obtained from Mercator4 (MapMan ontology) and HMMER3 against PfamA. Split these reference annotations into word sets. Make them available as RData in our research R package.

asishallab commented 3 years ago

Use R package stored here: /mnt/data/asis/prot-scriber/prot.scriber

General approach for all reference word sets should be to allow exclusion of certain words by regular expressions, e.g. the word "domain" or "protein" does not convey information about a protein's function. Use regular expression blacklist blacklist.word.regexs (from file ./inst/blacklist_token.txt).

Both annotations from Mercator4 and HMMER3 on PfamA can assign multiple annotations to a query protein. Note that in Mercator4 that is an extremely rare event.

First issue is to parse the Mercator4 result. Consider using (copying) readMercatorResultTable from here:

m.dt <- fread(path.2.mercator.result.tbl, sep = "\t", header = TRUE, stringsAsFactors = FALSE, 
        na.strings = "", quote = "")

Implement post-processing of column TYPE to become a boolean:

m.dt$TYPE <- # ...

Optional argument forces filtering for last column to be TRUE.

Then iterate over the unique (distinct) query proteins (can have more than one matching row) and extract all NAME and DESCRIPTION annotations. Split those into words using any white-space character and . as separators, lower-case the words, filter them with the regexs, sort them and retain them as reference words in an R list:

pc.ref.words.mercator4 <- list( `query.gene.id`=character( 'phosphatase', 'alien', 'oxidator' )

Utility functions should go into any file in ./R/, e.g. funks.R. Write a simple executable Rscript (example ./exec/loadPcoccineusSeqSimSearchResults.R) that parses the Mercator4 results of P. coccineus and stores the above reference words in an RData.

asishallab commented 3 years ago

Mercator 4 specialties

Ignore all annotations that are in root BIN 35, because they are just best Blast descriptions.

In annotations that are in root BIN 50 throw away the best Blast Hit marked with a leading &, e.g.

Enzyme classification.EC_1 oxidoreductases.EC_1.1 oxidoreductase acting on CH-OH group of donor(50.1.1 : 1064.8) & Probable mannitol dehydrogenase OS=Fragaria ananassa (sp|q9zrf1|mtdh_fraan : 478.0)

Delete everything after the & until the end of the string.

The rest should go into the reference word sets.

asishallab commented 3 years ago

PfamA specialties

Consider sifting through a few example rows not using the default split-regex in function wordSet but just \\s+ instead.

asishallab commented 2 years ago

Using annotations obtained from Mercator (MapMan Bin Ontology) and Pfam-A annotations as gold standard

usadellab / prot-scriber

Generate reference sets #2

Mercator 4 specialties

PfamA specialties