opentargets / issues

Issue tracker for Open Targets Platform and Open Targets Genetics Portal
https://platform.opentargets.org https://genetics.opentargets.org
Apache License 2.0
12 stars 2 forks source link

STRING datafile needs homology information #1463

Closed DSuveges closed 3 years ago

DSuveges commented 3 years ago

The current release of the STRING interaction data lacks homology information. It's missing because the currently used file doesn't have homology column:

https://stringdb-static.org/download/protein.links.detailed.v11.0/9606.protein.links.detailed.v11.0.txt.gz
protein1              protein2              neighborhood  fusion  cooccurence  coexpression  experimental  database  textmining  combined_score
9606.ENSP00000000233  9606.ENSP00000272298  0             0       332          62            181           0         125         490
9606.ENSP00000000233  9606.ENSP00000253401  0             0       0            0             186           0         56          198
9606.ENSP00000000233  9606.ENSP00000401445  0             0       0            0             159           0         0           159
9606.ENSP00000000233  9606.ENSP00000418915  0             0       0            61            158           0         542         606
9606.ENSP00000000233  9606.ENSP00000327801  0             0       0            88            78            0         89          167
9606.ENSP00000000233  9606.ENSP00000466298  0             0       0            141           131           0         98          267
9606.ENSP00000000233  9606.ENSP00000232564  0             0       0            62            171           0         56          201
9606.ENSP00000000233  9606.ENSP00000393379  0             0       0            61            131           0         43          150
9606.ENSP00000000233  9606.ENSP00000371253  0             0       0            61            0             0         224         240`

However the more complete file (https://stringdb-static.org/download/protein.links.full.v11.0/9606.protein.links.full.v11.0.txt.gz) which has the homology information, does not have the above listed pooled data:

protein1              protein2              neighborhood  neighborhood_transferred  fusion  cooccurence  homology  coexpression  coexpression_transferred  experiments  experiments_transferred  database  database_transferred  textmining  textmining_transferred  combined_score
9606.ENSP00000000233  9606.ENSP00000272298  0             0                         0       332          0         0             62                        0            181                      0         0                     0           125                     490
9606.ENSP00000000233  9606.ENSP00000253401  0             0                         0       0            0         0             0                         0            186                      0         0                     0           56                      198

The columns in the second file needs to be summarized to get the values in the first file. This needs to be implemented into the ingest script in platform input support.

As an interim solution the homology column simply can be joined into the first file.

DSuveges commented 3 years ago

Let's take a look at the ANGPTL3 target page. under molecular interactions, in the STRING tab, there are list of 681 pairwise interactions. List indicates interaction with ANGPTL4. The underlying data fetched from graphql looks like this:

{
  "intA": "ENSP00000360170",
  "intB": "ENSP00000301455",
  "targetB": {
    "approvedSymbol": "ANGPTL4",
    "id": "ENSG00000167772",
    "__typename": "Target"
  },
  "scoring": 0.831,
  "evidences": [
    {
      "evidenceScore": 0.17200000000000001,
      "interactionDetectionMethodShortName": "textmining",
      "__typename": "InteractionEvidence"
    },
    {
      "evidenceScore": 0.061,
      "interactionDetectionMethodShortName": "coexpression",
      "__typename": "InteractionEvidence"
    },
    {
      "evidenceScore": 0.8,
      "interactionDetectionMethodShortName": "database",
      "__typename": "InteractionEvidence"
    }
  ],
  "__typename": "Interaction"
}

The overall score is 0.83 and the interaction is supported by text mining (0.172), coexpression (0.061) and database (0.8), and based on the table there is no evidence based on homology and the other detection methods.

However checking out the same gene ANGPTL3 on STRING website (see Legend tab) we see there's an evidence for homology with score 0.809.

So, what we are missing from the evidences array is an other interactionEvidence object looking like this:

{
      "evidenceScore": 0.809,
      "interactionDetectionMethodShortName": "homology",
      "__typename": "InteractionEvidence"
},

This value is expected to be fetched from the 9606.protein.links.full.v11.0.txt.gz file eg.:

cat <(gzcat 9606.protein.links.full.v11.0.txt.gz | head -n1) \
      <(gzcat 9606.protein.links.full.v11.0.txt.gz | grep "9606.ENSP00000360170 9606.ENSP00000301455" | head -n1)  | \
      column -t

Giving this:

protein1              protein2              neighborhood  neighborhood_transferred  fusion  cooccurence  homology  coexpression  coexpression_transferred  experiments  experiments_transferred  database  database_transferred  textmining  textmining_transferred  combined_score
9606.ENSP00000360170  9606.ENSP00000301455  0             0                         0       0            809       0             61                        0            0                        800       0                     731         43                      831
(
DSuveges commented 3 years ago

As an interim solution the input file can be patched:

paste -d" " <(gzcat 9606.protein.links.detailed.v11.0.txt.gz  ) \
     <(gzcat 9606.protein.links.full.v11.0.txt.gz | cut -f7 -d " ") \
     | gzip > 9606.protein.links.full_w_homology.v11.0.txt.gz

Generating a file like this:

protein1              protein2              neighborhood  fusion  cooccurence  coexpression  experimental  database  textmining  combined_score  homology
9606.ENSP00000000233  9606.ENSP00000272298  0             0       332          62            181           0         125         490             0
9606.ENSP00000000233  9606.ENSP00000253401  0             0       0            0             186           0         56          198             0
9606.ENSP00000000233  9606.ENSP00000401445  0             0       0            0             159           0         0           159             0
9606.ENSP00000000233  9606.ENSP00000418915  0             0       0            61            158           0         542         606             0
9606.ENSP00000000233  9606.ENSP00000327801  0             0       0            88            78            0         89          167             0
9606.ENSP00000000233  9606.ENSP00000466298  0             0       0            141           131           0         98          267             0
9606.ENSP00000000233  9606.ENSP00000232564  0             0       0            62            171           0         56          201             0
9606.ENSP00000000233  9606.ENSP00000393379  0             0       0            61            131           0         43          150             0
9606.ENSP00000000233  9606.ENSP00000371253  0             0       0            61            0             0         224         240             0

In this file the above listed interaction between ANGPTL3 and ANGPTL4, we can see the evidence by homology with the expected value:

protein1              protein2              neighborhood  fusion  cooccurence  coexpression  experimental  database  textmining  combined_score  homology
9606.ENSP00000301455  9606.ENSP00000360170  0             0       0            61            0             800       172         831             809
9606.ENSP00000360170  9606.ENSP00000301455  0             0       0            61            0             800       172         831             809

The file is uploaded to here: gs://ot-team/dsuveges/interactions/9606.protein.links.full_w_homology.v11.0.txt.gz

cmalangone commented 3 years ago

Moved the code for processing String Protein into the ETL Added homology

https://github.com/opentargets/platform-etl-backend/pull/115 We use 9606.protein.links.full_w_homology.v11.0.txt.gz for the current release path = "gs://open-targets-data-releases/21.02/input/annotation_files/interactions/9606.protein.links.full_w_homology.v11.0.txt"