manastech / cafa5

4 stars 0 forks source link

Analysis: get a list of candidate GO terms #28

Open leandroradusky opened 1 year ago

leandroradusky commented 1 year ago

Now we have a method to compute candidate GO terms, we should investigate over which pairs of proteins-terms we should make predictions (the limit of the competition is 15k predictions, while the number of proteins in the test set is >140k and the number of GO terms are also tens of thousands).

For a first analysis, let's start with the direct child terms of those already assigned over the test set of proteins. Each term, based on its rarity over the whole protein universe, has a score (Information Accretion, here a full explanation of this term). Let's call this IA(term).

We should create an analysis where we compute: 1) All the direct child GO terms over the test set of proteins, saving for each candidate term the number of proteins this term is a candidate for (let's call this #proteins(term)). 2) We will go naive: we will rank the terms to be predicted by multiplying #proteins(term) * IA(term) for each term. 3) We should compute the pais of GO terms - proteins to be predicted, with a cutoff on the 15k predictions.

Usually, jupyter notebooks are used to make analyses more than scripts, since you can describe the step-to-step with markdown, plot things, etc. which will be useful to communicate our decisions toward the final predictions. Notebooks are well displayed in GitHub, they format the markdown, display the plots, etc. Let's include the generated notebook in a folder called analyses and "consume" the functionalities of the package already developed as a first example of its use also.

nthiad commented 1 year ago

partly added in #33 but jupyter notebook needs to be written