Create a combined score with only human evidence

dhimmel commented 4 years ago

For some applications, it might be nice to avoid any scores transferred from other species. Will look into creating a human-evidence-only combined score.

See this notebook for stats on the score distribution for each channel for human proteins. Note that the neighborhood and database_transferred channels are all zero for human proteins.

dhimmel commented 4 years ago

Background quotes

Quotes from

STRING v11: protein–protein association networks with increased coverage, supporting functional discovery in genome-wide experimental datasets
Damian Szklarczyk, Annika L Gable, David Lyon, Alexander Junge, Stefan Wyder, Jaime Huerta-Cepas, Milan Simonovic, Nadezhda T Doncheva, John H Morris, Peer Bork, … Christian von Mering
Nucleic Acids Research (2018-11-22) https://doi.org/gfz2jr
DOI: 10.1093/nar/gky1131 · PMID: 30476243 · PMCID: PMC6323986

Within each channel, the evidence is further subdivided into two sub-scores, one of which represents evidence stemming from the organism itself, and the other represents evidence transferred from other organisms. For the latter transfer, the 'interolog' concept is applied (42,43); STRING uses hierarchically arranged orthologous group relations as defined in eggNOG (32), in order to transfer associations between organisms where applicable (described in (29)).

Quotes from https://string-db.org/help/faq/

The combined score is computed by combining the probabilities from the different evidence channels and corrected for the probability of randomly observing an interaction. For a more detailed description please see von Mering, et al. Nucleic Acids Res. 2005

From von Mering, et al. Nucleic Acids Res. 2005:

After assignment of association scores and transfer between species, we compute a final ‘combined score’ between any pair of proteins (or pair of COGs). This score is often higher than the individual sub-scores, expressing increased confidence when an association is supported by several types of evidence ( Table 1 ). It is computed under the assumption of independence for the various sources, in a naïve Bayesian fashion. It is thus a simple expression of the individual scores:

Also see the python script combine_subscores.py.

From FAQ "How to retrieve only the direct evidence in human, not transferred":

You need the file: protein.links.full.txt.gz, from which you can retrieve the columns like above and write it to a file.
zgrep ^"9606\." protein.links.full.txt.gz  | awk '($16 > 700) { print $1, $2, $3, $5, $6, $7, $8, $10, $12, $14, $16 }' > PPI_700_human.txt

Homology correction described in 2008 blog post:

In order to avoid that gene duplications lead spurious functional associations, homologous proteins are down-weighed in the co-occurrence and text-mining channels.

dhimmel commented 4 years ago

Study that addresses why one might want to exclude transferred interactions:

What Evidence Is There for the Homology of Protein-Protein Interactions?
Anna C. F. Lewis, Nick S. Jones, Mason A. Porter, Charlotte M. Deane
PLoS Computational Biology (2012-09-20) https://doi.org/ggh6zz
DOI: 10.1371/journal.pcbi.1002645 · PMID: 23028270 · PMCID: PMC3447968

Main take away:

Our results imply that, unless using strict definitions of homology, interactions rewire at a rate too fast to allow reliable transfer across species.

dhimmel commented 4 years ago

Another question is whether to include the "genomic context prediction" channels. From the v11.0 paper:

The three genomic context prediction channels (neighborhood, fusion, gene co-occurrence) are the result of systematic all-against-all genome comparisons, aiming to assess the consequences of past genome rearrangements, gene gains and losses, as well as gene fusion events. These evolutionary events are known to be retained non-randomly with respect to the functional roles of genes, and thus allow the inference of functional associations between genes even for otherwise rarely studied organisms (genomic context techniques are reviewed in (44,45)).

dhimmel commented 4 years ago

See the 05.combine-subscores.ipynb notebook. It does look like excluding non-human evidence channels will cause a widespread drop in scores. Whether this is beneficial for a given application is another matter.

related-sciences / string-protein-network

Create a combined score with only human evidence #2

Background quotes