Closed DSuveges closed 3 years ago
Let's take a look at the ANGPTL3 target page. under molecular interactions, in the STRING tab, there are list of 681 pairwise interactions. List indicates interaction with ANGPTL4. The underlying data fetched from graphql looks like this:
{
"intA": "ENSP00000360170",
"intB": "ENSP00000301455",
"targetB": {
"approvedSymbol": "ANGPTL4",
"id": "ENSG00000167772",
"__typename": "Target"
},
"scoring": 0.831,
"evidences": [
{
"evidenceScore": 0.17200000000000001,
"interactionDetectionMethodShortName": "textmining",
"__typename": "InteractionEvidence"
},
{
"evidenceScore": 0.061,
"interactionDetectionMethodShortName": "coexpression",
"__typename": "InteractionEvidence"
},
{
"evidenceScore": 0.8,
"interactionDetectionMethodShortName": "database",
"__typename": "InteractionEvidence"
}
],
"__typename": "Interaction"
}
The overall score is 0.83
and the interaction is supported by text mining (0.172
), coexpression (0.061
) and database (0.8
), and based on the table there is no evidence based on homology and the other detection methods.
However checking out the same gene ANGPTL3 on STRING website (see Legend
tab) we see there's an evidence for homology with score 0.809
.
So, what we are missing from the evidences
array is an other interactionEvidence
object looking like this:
{
"evidenceScore": 0.809,
"interactionDetectionMethodShortName": "homology",
"__typename": "InteractionEvidence"
},
This value is expected to be fetched from the 9606.protein.links.full.v11.0.txt.gz
file eg.:
cat <(gzcat 9606.protein.links.full.v11.0.txt.gz | head -n1) \
<(gzcat 9606.protein.links.full.v11.0.txt.gz | grep "9606.ENSP00000360170 9606.ENSP00000301455" | head -n1) | \
column -t
Giving this:
protein1 protein2 neighborhood neighborhood_transferred fusion cooccurence homology coexpression coexpression_transferred experiments experiments_transferred database database_transferred textmining textmining_transferred combined_score
9606.ENSP00000360170 9606.ENSP00000301455 0 0 0 0 809 0 61 0 0 800 0 731 43 831
(
As an interim solution the input file can be patched:
paste -d" " <(gzcat 9606.protein.links.detailed.v11.0.txt.gz ) \
<(gzcat 9606.protein.links.full.v11.0.txt.gz | cut -f7 -d " ") \
| gzip > 9606.protein.links.full_w_homology.v11.0.txt.gz
Generating a file like this:
protein1 protein2 neighborhood fusion cooccurence coexpression experimental database textmining combined_score homology
9606.ENSP00000000233 9606.ENSP00000272298 0 0 332 62 181 0 125 490 0
9606.ENSP00000000233 9606.ENSP00000253401 0 0 0 0 186 0 56 198 0
9606.ENSP00000000233 9606.ENSP00000401445 0 0 0 0 159 0 0 159 0
9606.ENSP00000000233 9606.ENSP00000418915 0 0 0 61 158 0 542 606 0
9606.ENSP00000000233 9606.ENSP00000327801 0 0 0 88 78 0 89 167 0
9606.ENSP00000000233 9606.ENSP00000466298 0 0 0 141 131 0 98 267 0
9606.ENSP00000000233 9606.ENSP00000232564 0 0 0 62 171 0 56 201 0
9606.ENSP00000000233 9606.ENSP00000393379 0 0 0 61 131 0 43 150 0
9606.ENSP00000000233 9606.ENSP00000371253 0 0 0 61 0 0 224 240 0
In this file the above listed interaction between ANGPTL3 and ANGPTL4, we can see the evidence by homology with the expected value:
protein1 protein2 neighborhood fusion cooccurence coexpression experimental database textmining combined_score homology
9606.ENSP00000301455 9606.ENSP00000360170 0 0 0 61 0 800 172 831 809
9606.ENSP00000360170 9606.ENSP00000301455 0 0 0 61 0 800 172 831 809
The file is uploaded to here:
gs://ot-team/dsuveges/interactions/9606.protein.links.full_w_homology.v11.0.txt.gz
Moved the code for processing String Protein into the ETL Added homology
https://github.com/opentargets/platform-etl-backend/pull/115
We use 9606.protein.links.full_w_homology.v11.0.txt.gz for the current release
path = "gs://open-targets-data-releases/21.02/input/annotation_files/interactions/9606.protein.links.full_w_homology.v11.0.txt"
The current release of the STRING interaction data lacks homology information. It's missing because the currently used file doesn't have
homology
column:However the more complete file (
https://stringdb-static.org/download/protein.links.full.v11.0/9606.protein.links.full.v11.0.txt.gz
) which has the homology information, does not have the above listed pooled data:The columns in the second file needs to be summarized to get the values in the first file. This needs to be implemented into the ingest script in platform input support.
As an interim solution the homology column simply can be joined into the first file.