New VL breast dataset for integration into the PPP

buniello commented 2 months ago

The Validation Lab is ready to share Breast Cancer data with us for integration into PPP. The dataset adds a number of additional data points to the colorectal one (currently in PPP)-such as three different contrast assays - and presents a challenge related to how to assess the hit and how to visualise this assessment. Moreover, we may need to consider adding two new gold standard datasets to the widget: Project Score v2 and Dep Map.

[x] The action item for the team is now to investigate the dataset and present a couple of options to the VL on how we could integrate/visualise all the data dimensions into the PPP.

@DSuveges please feel free to add to or edit this initial ticket.

DSuveges commented 2 months ago

Based on the discussion, I believe the ingestion bit should not be a problem. At this point the blocker is having a solidified idea on what the data should look like. More specifically:

What is the delivered message from VL.
How this message can be generalised over all present and future VL deliveries to ensure sustainability and across VL experiments.

Once these questions are answered, the developement of a data model should not be too complicted, however we need to keep in mind the load on backend and frontend, and make sure this load is sustainable, and there would be no required furter investment with new VL data in the future.

buniello commented 2 months ago

For internal records, I am sharing here the link to a widget mock-up we have discussed last week. The data team is now working on parsing the data into the agreed schema, and myself/FE will pick the task from there moving forward.

DSuveges commented 2 months ago

This is the new VL evidence schema:

{
    "targetFromSourceId": "ADSL",
    "diseaseFromSourceMappedId": "EFO_0000305",
    "diseaseFromSource": "breast carcinoma",
    "resourceScore": 1.0,
    "assays": [
        {
            "shortName": "CellTiter-Glo",
            "description": "Homogeneous method of determining the number of viable cells in culture based on quantitation of the ATP present, an indicator of metabolically active cells",
            "studyOverview": "Cas9-expressing cell lines were transfected with gene-specific gRNAs in 384-well arrayed format. After 7 days in culture, endpoint viability was measured by CellTitreGlo."
            "isHit": true
        },
        {
            "shortName": "Cell Confluence",
            "description": "",
            "studyOverview": ""
            "isHit": true
        },
        {
            "shortName": "Cell Toxicity",
            "description": "",
            "studyOverview": ""
            "isHit": true
        }
    ]
    "assessment": "Multiple evidence of dependency. Evidence of toxicity",
    "diseaseCellLines": [
        {
            "name": "MCF7",
            "id": "SIDM00148",
            "tissue": "Breast",
            "tissueId": "UBERON_0000310"
        }
    ],
    "biomarkerList": [
        {
            "name": "MSS",
            "description": "Microsatellite stable"
        }
    ],
    "primaryProjectId": "OTAR015",
    "isPrimaryProjectHit": true,
    "projectId": "OTAR2059",
    "projectDescription": "CRISPR Cas9 Target ID",
    "datasourceId": "ot_crispr_validation",
    "datatypeId": "ot_validation_lab",
    "releaseDate": "2024-04-30",
    "releaseVersion": "OTAR3059-893478954"
}

buniello commented 2 months ago

@mbdebian - we now have a schema for the new VL lab dataset (see above), as mentioned yesterday. Tagging you and also @prashantuniyal02 to this so that the tasks can continue downstream (PIS+ETL steps).

DSuveges commented 2 months ago

@mbdebian, I'll soon will provide list of fields that will no longer be in the schema, so they can be removed from the evidence data model.

DSuveges commented 1 month ago

Hi @mbdebian I have a draft evidence set here:

gs://ot-team/dsuveges/vl.json.gz/part-00000-e2c940e4-6ecb-4386-bfb2-f4df1991443f-c000.json

This is based on a mock data from validation lab, but we are expecting to get a full dataset soon.

The following columns can be removed from the evidence data model:

validationHypotheses. With all its nested fields.
expectedConfidence

mbdebian commented 1 month ago

Thank you @DSuveges !

DSuveges commented 1 month ago

There has been some marginal update. This is (hopefully) the final form of the schema:

{
  "datasourceId": "ot_crispr_validation",
  "datatypeId": "ot_validation_lab",
  "studyOverview": "Cas9-expressing cell lines were transfected with gene-specific gRNAs in 384-well arrayed format. Cell confluence and toxicity were monitored by live-cell imaging and endpoint viability (day 8-13 depending on cell line) was measured by CellTitre Glo",
  "projectId": "OTAR2059",
  "releaseDate": "2024-04-30",
  "releaseVersion": "OTAR2059-44690",
  "targetFromSourceId": "ZFP36L1",
  "diseaseFromSourceMappedId": "EFO_0000305",
  "diseaseFromSource": "breast carcinoma",
  "resourceScore": 1.0,
  "assessment": "No evidence of dependency No evidence of toxicity",
  "assays": [
    {
      "shortName": "CellTiter-Glo",
      "description": "Homogeneous method of quantifying the presence of ATP, an indicator of metabolically active cells, and using this as a proxy to determine the number of viable cells.",
      "isHit": true
    },
    {
      "shortName": "Toxicity",
      "description": "Determining the maximal percentage of cells displaying toxic signal by using live-cell imaging to quantify CellTox Green, which binds DNA of cells with impaired membrane integrity.",
      "isHit": true
    },
    {
      "shortName": "Confluence",
      "description": "Using live-cell kinetic imaging to quantify cell confluence as a proxy for number of viable cells over time, and calculation of area under the growth curve (AUC).",
      "isHit": true
    }
  ],
  "diseaseCellLines": [
    {
      "name": "MCF7",
      "id": "SIDM00148",
      "tissue": "breast",
      "tissueId": "UBERON_0000310"
    }
  ],
  "biomarkerList": [
    {
      "name": "Luminal B",
      "description": "PAM50 status: luminal A"
    },
    {
      "name": "Hormone dependent",
      "description": "Hormone dependency: hormone dependent"
    }
  ],
  "primaryProjectHit": true,
  "primaryProjectId": "OTAR015"
}

Data file is here: gs://ot-team/dsuveges/new_validation_lab.json.gz

buniello commented 1 month ago

@mbdebian let us know if it's all clear from your side!

mbdebian commented 1 month ago

@mbdebian - we now have a schema for the new VL lab dataset (see above), as mentioned yesterday. Tagging you and also @prashantuniyal02 to this so that the tasks can continue downstream (PIS+ETL steps).

I just checked with PIS, and no update is required in there.

Next step is the ETL

mbdebian commented 1 month ago

Hi @mbdebian I have a draft evidence set here:
gs://ot-team/dsuveges/vl.json.gz/part-00000-e2c940e4-6ecb-4386-bfb2-f4df1991443f-c000.json
This is based on a mock data from validation lab, but we are expecting to get a full dataset soon.

The following columns can be removed from the evidence data model:

validationHypotheses. With all its nested fields.

expectedConfidence

Regarding data model, I've checked that PIS does not perform any additional operations on the evidence beyond data collection, I can imagine that ETL and API components will require updates

mbdebian commented 1 month ago

I've just compared the schemas:

Old Validation Lab data model

root
 |-- biomarkerList: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- description: string (nullable = true)
 |    |    |-- name: string (nullable = true)
 |-- confidence: string (nullable = true)
 |-- contrast: string (nullable = true)
 |-- datasourceId: string (nullable = true)
 |-- datatypeId: string (nullable = true)
 |-- diseaseCellLines: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- id: string (nullable = true)
 |    |    |-- name: string (nullable = true)
 |    |    |-- tissue: string (nullable = true)
 |    |    |-- tissueId: string (nullable = true)
 |-- diseaseFromSource: string (nullable = true)
 |-- diseaseFromSourceMappedId: string (nullable = true)
 |-- expectedConfidence: string (nullable = true)
 |-- projectDescription: string (nullable = true)
 |-- projectId: string (nullable = true)
 |-- releaseDate: string (nullable = true)
 |-- releaseVersion: string (nullable = true)
 |-- resourceScore: double (nullable = true)
 |-- statisticalTestTail: string (nullable = true)
 |-- studyOverview: string (nullable = true)
 |-- targetFromSourceId: string (nullable = true)
 |-- validationHypotheses: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- description: string (nullable = true)
 |    |    |-- name: string (nullable = true)
 |    |    |-- status: string (nullable = true)

New Validation Lab data model

root
 |-- assays: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- description: string (nullable = true)
 |    |    |-- isHit: boolean (nullable = true)
 |    |    |-- shortName: string (nullable = true)
 |-- assessment: string (nullable = true)
 |-- biomarkerList: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- description: string (nullable = true)
 |    |    |-- name: string (nullable = true)
 |-- datasourceId: string (nullable = true)
 |-- datatypeId: string (nullable = true)
 |-- diseaseCellLines: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- id: string (nullable = true)
 |    |    |-- name: string (nullable = true)
 |    |    |-- tissue: string (nullable = true)
 |    |    |-- tissueId: string (nullable = true)
 |-- diseaseFromSource: string (nullable = true)
 |-- diseaseFromSourceMappedId: string (nullable = true)
 |-- primaryProjectHit: boolean (nullable = true)
 |-- primaryProjectId: string (nullable = true)
 |-- projectId: string (nullable = true)
 |-- releaseDate: string (nullable = true)
 |-- releaseVersion: string (nullable = true)
 |-- resourceScore: double (nullable = true)
 |-- studyOverview: string (nullable = true)
 |-- targetFromSourceId: string (nullable = true)

Data Models Delta

New attributes

root
 |-- assays: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- description: string (nullable = true)
 |    |    |-- isHit: boolean (nullable = true)
 |    |    |-- shortName: string (nullable = true)
 |-- assessment: string (nullable = true)
 |-- primaryProjectHit: boolean (nullable = true)
 |-- primaryProjectId: string (nullable = true)

REMOVED attributes

root
 |-- confidence: string (nullable = true)
 |-- contrast: string (nullable = true)
 |-- expectedConfidence: string (nullable = true)
 |-- projectDescription: string (nullable = true)
 |-- statisticalTestTail: string (nullable = true)
 |-- validationHypotheses: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- description: string (nullable = true)
 |    |    |-- name: string (nullable = true)
 |    |    |-- status: string (nullable = true)

mbdebian commented 1 month ago

Hi @mbdebian I have a draft evidence set here:
gs://ot-team/dsuveges/vl.json.gz/part-00000-e2c940e4-6ecb-4386-bfb2-f4df1991443f-c000.json
This is based on a mock data from validation lab, but we are expecting to get a full dataset soon.

The following columns can be removed from the evidence data model:

validationHypotheses. With all its nested fields.

expectedConfidence

These columns are not being addressed in the ETL, the API will be searched for these

mbdebian commented 1 month ago

The ETL is ok with the new given validation lab sample dataset at gs://ot-team/dsuveges/new_validation_lab.json.gz

These checks have been completed in the ETL branch issue_3298, which will stay open in order to check the pipeline with the real dataset.

The next checkpoints in the data journey are Opensearch and Open Targets GraphQL API

mbdebian commented 1 month ago

The API Evidence data model has been updated, and its 24.1.0-dev.4 tag release is deployed in our development environments for platform and PPP for testing @opentargets/data-team @opentargets/fe-team , please, provide feedback on the new 'evidence' data model that comes with this API. Thanks!

mbdebian commented 3 weeks ago

@prashantuniyal02 , BE works on this have been completed, and no problems beyond the API have been reported, would it be possible to flag this issue as completed?

prashantuniyal02 commented 4 days ago

Released in PPP

opentargets / issues