Updating interaction dataset imported from STRING

DSuveges commented 1 year ago

It has been advertised on [string-db]() that the new v12.0 version of the resource is available. We need to look into how this update could be automated and

Tasks

[x] Download datasets and explore changes.
[x] Identify relevant columns. (It seems the current version on the platform v11.5 shows inconsistent data compared to string-db)
[x] Update script to collect these columns into one single dataset. (This logic should either live in ETL or maintained by the data team).
[x] Update documentation with the new citation (pmid: 36370105),
[x] Update table header v11 -> v12.0. (This info has been off on our website for ever.)

Acceptance tests

There's a set of steps to complete to close this ticket:

Ensure the ingested dataset is consistent with the data shown on string-db. Which indicates the data is ready to be released.
Then update documentation with the new reference.
Then update table header.

DSuveges commented 1 year ago

Exploring the STRING website as of 18 July 2023, the new version is available via UI, API and downloadable formats as well. However the version endpoint on the old and new API urls returns the same data (v.11.5) [1]. This is not consistent with the other endpoints[2], where the returned data is different.

URLS to get the version:

New: https://version-12-0.string-db.org/api/json/version
Old: https://string-db.org/api/json/version

returns:

[
{
"string_version": "11.5",
"stable_address": "https://version-11-5.string-db.org"
}
]

Top interaction partners for TP53:

Old: https://string-db.org/api/json/interaction_partners?identifiers=TP53&limit=100
New: https://version-12-0.string-db.org/api/json/interaction_partners?identifiers=TP53&limit=100
The first 100 interaction show 60% overlap and even these interactions were conflicting in the overall association score.

I let the STRING team know about this inconsistency, waiting for their reply.

DSuveges commented 1 year ago

What is slightly more concerning is that the current data shown on platform is not consistent with the new or the old STRING data:

+--------------------+--------------------+---------+---------+--------------+
|          stringId_A|          stringId_B|score_old|score_new|score_platform|
+--------------------+--------------------+---------+---------+--------------+
|9606.ENSP00000269305|9606.ENSP00000340989|    0.999|    0.999|         0.986|
|9606.ENSP00000269305|9606.ENSP00000263253|    0.999|    0.999|         0.999|
|9606.ENSP00000269305|9606.ENSP00000437955|    0.999|    0.999|         0.977|
|9606.ENSP00000269305|9606.ENSP00000362649|    0.999|    0.999|          0.99|
|9606.ENSP00000269305|9606.ENSP00000335153|    0.999|    0.999|          0.99|
|9606.ENSP00000269305|9606.ENSP00000278616|    0.999|    0.999|         0.999|
|9606.ENSP00000269305|9606.ENSP00000356150|    0.999|    0.999|         0.999|
|9606.ENSP00000269305|9606.ENSP00000372023|    0.998|    0.999|         0.998|
|9606.ENSP00000269305|9606.ENSP00000381185|    0.995|    0.999|         0.984|
|9606.ENSP00000269305|9606.ENSP00000365230|      NaN|    0.999|           NaN|
|9606.ENSP00000269305|9606.ENSP00000354218|      NaN|    0.999|           NaN|
|9606.ENSP00000269305|9606.ENSP00000266000|      NaN|    0.999|           NaN|
|9606.ENSP00000269305|9606.ENSP00000212015|    0.999|    0.999|         0.996|
|9606.ENSP00000269305|9606.ENSP00000418960|    0.999|    0.999|         0.996|
|9606.ENSP00000269305|9606.ENSP00000384849|    0.999|    0.999|         0.999|
|9606.ENSP00000269305|9606.ENSP00000341957|    0.999|    0.999|         0.998|
|9606.ENSP00000269305|9606.ENSP00000497594|      NaN|    0.999|           NaN|
|9606.ENSP00000269305|9606.ENSP00000343535|    0.999|    0.999|         0.987|
|9606.ENSP00000269305|9606.ENSP00000262367|    0.999|    0.999|         0.999|
|9606.ENSP00000269305|9606.ENSP00000258149|    0.999|    0.999|         0.999|
+--------------------+--------------------+---------+---------+--------------+
only showing top 20 rows

DSuveges commented 1 year ago

By looking at the old ticket about STRING update (#1509 ), it is apparent that at that time, the dataset was not udpated:

detailed_url = 'https://stringdb-static.org/download/protein.links.detailed.v11.0/9606.protein.links.detailed.v11.0.txt.gz'
full_url = 'https://stringdb-static.org/download/protein.links.full.v11.0/9606.protein.links.full.v11.0.txt.gz'

## Joining the two dataset:
merged_df = detailed_df.merge(full_df, on=['protein1', 'protein2'], how='left')

# Saving data:
merged_df.to_csv('9606.protein.links.full_w_homology.v11.5.txt.gz', sep=' ', index=False, compression='infer')

So, somehow the v.11.0 become v.11.5. I could confirm this by working with the actual v.11.5 data. As I could recapitulate the API response, I could also confirm that the logic is sound, so we can assume, by plugging in the new v.12.0 dataset, it would work fine.

DSuveges commented 1 year ago

The new dataset for STRING v.12.0 is available here: gs://ot-team/dsuveges/9606.protein.links.full_w_homology.v12.0.txt.gz

The schema is the same as the old file:

╰─ gsutil cat  gs://ot-team/dsuveges/9606.protein.links.full_w_homology.v12.0.txt.gz | gzcat | head -n5 | column -t
protein1              protein2              neighborhood  fusion  cooccurence  coexpression  experimental  database  textmining  combined_score  homology
9606.ENSP00000000233  9606.ENSP00000356607  0             0       0            45            134           0         0           173             0
9606.ENSP00000000233  9606.ENSP00000427567  0             0       0            0             128           0         0           154             0
9606.ENSP00000000233  9606.ENSP00000253413  0             0       0            118           49            0         0           151             0
9606.ENSP00000000233  9606.ENSP00000493357  0             0       0            56            53            0         433         471             0

vs. v11.5:

╰─ gsutil cat gs://open-targets-data-releases/22.11/input/interactions-inputs/9606.protein.links.full_w_homology.v11.5.txt.gz | gzcat | head -n5 | column -t
protein1              protein2              neighborhood  fusion  cooccurence  coexpression  experimental  database  textmining  combined_score  homology
9606.ENSP00000000233  9606.ENSP00000272298  0             0       332          62            181           0         125         490             0
9606.ENSP00000000233  9606.ENSP00000253401  0             0       0            0             186           0         56          198             0
9606.ENSP00000000233  9606.ENSP00000401445  0             0       0            0             159           0         0           159             0
9606.ENSP00000000233  9606.ENSP00000418915  0             0       0            61            158           0         542         606             0

The content of the file is consistent with the STRING website and API:

protein1              protein2              neighborhood  fusion  cooccurence  coexpression  experimental  database  textmining  combined_score  homology
9606.ENSP00000269305  9606.ENSP00000340989  0             0       0            0             981           750       859         999             0

v12.0 API response:

{
    "stringId_A": "9606.ENSP00000269305",
    "stringId_B": "9606.ENSP00000340989",
    "preferredName_A": "TP53",
    "preferredName_B": "SFN",
    "ncbiTaxonId": 9606,
    "score": 0.999,
    "nscore": 0,
    "fscore": 0,
    "pscore": 0,
    "ascore": 0,
    "escore": 0.981,
    "dscore": 0.75,
    "tscore": 0.859
},

The size of the dataset grew from 11.7 to 13.7M.

DSuveges commented 1 year ago

Downstream updates:

PIS

PR is opened containing the following changes):

STRING derived interaction is bumped to the newest release (v.12.0).
Output file is renamed to make it more intuitive. (Dependency: the ETL code also needs to be updated. )
Also sourced from GS bucket. Files are dated.
string-interactions.json.gz is removed as it is not picked up by the ETL

ETL

A small commit to the ETL was directly pushed reflecting the renamed output file.

DSuveges commented 1 year ago

The STRING team could resolve the API version discrepancy:

Thanks a lot for the feedback. It's only the "version" API that returns the
same output as v11.5 This is because it is used  by R-package, which has to be
checked for compatibility, before full release. 

All the rest of the APIs return correct v12 results. I will change it in the
next few days, when I update the R-package.

ireneisdoomed commented 1 year ago

Important for @HelenaCornu @buniello to bear this in mind for comms.

HelenaCornu commented 1 year ago

Thanks @ireneisdoomed! Can I check whether I have understood:

There was an issue which mean that we have been showing an outdated version of String data. However with this update, we are now in line with the latest data, and the size of the dataset has increased by 2M.

ireneisdoomed commented 1 year ago

@HelenaCornu I would omit the issue and inform that we have integrated the latest data from STRING.

I don't see any release notes on their site, but they have recently published a manuscript based on the latest version. @DSuveges, is there anything in specific we want to highlight? If not I can help prepare some notes based on their publication, depends on how comprehensive we want to be about it.

prashantuniyal02 commented 1 year ago

Hi @LucaFumis, will you be able to update the String version from 11 to 12 in the Molecular Interactions widget on the platform?

LucaFumis commented 1 year ago

Hi @prashantuniyal02, the information in the 4 tabs headers actually comes from our API through this query: https://github.com/opentargets/ot-ui-apps/blob/main/apps/platform/src/sections/target/MolecularInteractions/InteractionsStats.gql#L3

query InteractionsSectionQuery($ensgId: String!) {
  interactionResources {
    databaseVersion
    sourceDatabase
  }

  target(ensemblId: $ensgId) {
    id
    intact: interactions(sourceDatabase: "intact") {
      count
    }
    signor: interactions(sourceDatabase: "signor") {
      count
    }
    reactome: interactions(sourceDatabase: "reactome") {
      count
    }
    string: interactions(sourceDatabase: "string") {
      count
    }
  }
}

for TSLP for example the response is:

{
    "interactionResources": [
        {
            "databaseVersion": "11",
            "sourceDatabase": "string",
            "__typename": "InteractionResources"
        },
        {
            "databaseVersion": "243",
            "sourceDatabase": "intact",
            "__typename": "InteractionResources"
        },
        {
            "databaseVersion": "81",
            "sourceDatabase": "reactome",
            "__typename": "InteractionResources"
        },
        {
            "databaseVersion": "Not Available",
            "sourceDatabase": "signor",
            "__typename": "InteractionResources"
        }
    ],
    "target": {
        "id": "ENSG00000145777",
        "intact": {
            "count": 2,
            "__typename": "Interactions"
        },
        "signor": {
            "count": 1,
            "__typename": "Interactions"
        },
        "reactome": {
            "count": 3,
            "__typename": "Interactions"
        },
        "string": {
            "count": 615,
            "__typename": "Interactions"
        },
        "__typename": "Target"
    }
}

We would need to update that at API level, unless we want to hardcode that in the front end (which I guess we don't)

DSuveges commented 1 year ago

@HelenaCornu , @ireneisdoomed I wouldn't go crazy with the coms for this update given we are kind of underutilizing the network data (it's just an annotation at the moment). More over, we are quite underutilizing STRING database itself, as we are only extracting binary interactions with the scores. We can highlight some of the key points from the publication: thank to the improved analytical methods, the number of binary interactions not only grew in size, but also the scores got better.

prashantuniyal02 commented 1 year ago

Updated STRING database citation in the documentation and BE team has updated the STRING table header to version 12.0

opentargets / issues