Closed buniello closed 4 months ago
Based on the discussion, I believe the ingestion bit should not be a problem. At this point the blocker is having a solidified idea on what the data should look like. More specifically:
Once these questions are answered, the developement of a data model should not be too complicted, however we need to keep in mind the load on backend and frontend, and make sure this load is sustainable, and there would be no required furter investment with new VL data in the future.
For internal records, I am sharing here the link to a widget mock-up we have discussed last week. The data team is now working on parsing the data into the agreed schema, and myself/FE will pick the task from there moving forward.
This is the new VL evidence schema:
{
"targetFromSourceId": "ADSL",
"diseaseFromSourceMappedId": "EFO_0000305",
"diseaseFromSource": "breast carcinoma",
"resourceScore": 1.0,
"assays": [
{
"shortName": "CellTiter-Glo",
"description": "Homogeneous method of determining the number of viable cells in culture based on quantitation of the ATP present, an indicator of metabolically active cells",
"studyOverview": "Cas9-expressing cell lines were transfected with gene-specific gRNAs in 384-well arrayed format. After 7 days in culture, endpoint viability was measured by CellTitreGlo."
"isHit": true
},
{
"shortName": "Cell Confluence",
"description": "",
"studyOverview": ""
"isHit": true
},
{
"shortName": "Cell Toxicity",
"description": "",
"studyOverview": ""
"isHit": true
}
]
"assessment": "Multiple evidence of dependency. Evidence of toxicity",
"diseaseCellLines": [
{
"name": "MCF7",
"id": "SIDM00148",
"tissue": "Breast",
"tissueId": "UBERON_0000310"
}
],
"biomarkerList": [
{
"name": "MSS",
"description": "Microsatellite stable"
}
],
"primaryProjectId": "OTAR015",
"isPrimaryProjectHit": true,
"projectId": "OTAR2059",
"projectDescription": "CRISPR Cas9 Target ID",
"datasourceId": "ot_crispr_validation",
"datatypeId": "ot_validation_lab",
"releaseDate": "2024-04-30",
"releaseVersion": "OTAR3059-893478954"
}
@mbdebian - we now have a schema for the new VL lab dataset (see above), as mentioned yesterday. Tagging you and also @prashantuniyal02 to this so that the tasks can continue downstream (PIS+ETL steps).
@mbdebian, I'll soon will provide list of fields that will no longer be in the schema, so they can be removed from the evidence data model.
Hi @mbdebian I have a draft evidence set here:
gs://ot-team/dsuveges/vl.json.gz/part-00000-e2c940e4-6ecb-4386-bfb2-f4df1991443f-c000.json
This is based on a mock data from validation lab, but we are expecting to get a full dataset soon.
The following columns can be removed from the evidence data model:
Thank you @DSuveges !
There has been some marginal update. This is (hopefully) the final form of the schema:
{
"datasourceId": "ot_crispr_validation",
"datatypeId": "ot_validation_lab",
"studyOverview": "Cas9-expressing cell lines were transfected with gene-specific gRNAs in 384-well arrayed format. Cell confluence and toxicity were monitored by live-cell imaging and endpoint viability (day 8-13 depending on cell line) was measured by CellTitre Glo",
"projectId": "OTAR2059",
"releaseDate": "2024-04-30",
"releaseVersion": "OTAR2059-44690",
"targetFromSourceId": "ZFP36L1",
"diseaseFromSourceMappedId": "EFO_0000305",
"diseaseFromSource": "breast carcinoma",
"resourceScore": 1.0,
"assessment": "No evidence of dependency No evidence of toxicity",
"assays": [
{
"shortName": "CellTiter-Glo",
"description": "Homogeneous method of quantifying the presence of ATP, an indicator of metabolically active cells, and using this as a proxy to determine the number of viable cells.",
"isHit": true
},
{
"shortName": "Toxicity",
"description": "Determining the maximal percentage of cells displaying toxic signal by using live-cell imaging to quantify CellTox Green, which binds DNA of cells with impaired membrane integrity.",
"isHit": true
},
{
"shortName": "Confluence",
"description": "Using live-cell kinetic imaging to quantify cell confluence as a proxy for number of viable cells over time, and calculation of area under the growth curve (AUC).",
"isHit": true
}
],
"diseaseCellLines": [
{
"name": "MCF7",
"id": "SIDM00148",
"tissue": "breast",
"tissueId": "UBERON_0000310"
}
],
"biomarkerList": [
{
"name": "Luminal B",
"description": "PAM50 status: luminal A"
},
{
"name": "Hormone dependent",
"description": "Hormone dependency: hormone dependent"
}
],
"primaryProjectHit": true,
"primaryProjectId": "OTAR015"
}
Data file is here: gs://ot-team/dsuveges/new_validation_lab.json.gz
@mbdebian let us know if it's all clear from your side!
@mbdebian - we now have a schema for the new VL lab dataset (see above), as mentioned yesterday. Tagging you and also @prashantuniyal02 to this so that the tasks can continue downstream (PIS+ETL steps).
I just checked with PIS, and no update is required in there.
Next step is the ETL
Hi @mbdebian I have a draft evidence set here:
gs://ot-team/dsuveges/vl.json.gz/part-00000-e2c940e4-6ecb-4386-bfb2-f4df1991443f-c000.json
This is based on a mock data from validation lab, but we are expecting to get a full dataset soon.
The following columns can be removed from the evidence data model:
- validationHypotheses. With all its nested fields.
- expectedConfidence
Regarding data model, I've checked that PIS does not perform any additional operations on the evidence beyond data collection, I can imagine that ETL and API components will require updates
I've just compared the schemas:
root
|-- biomarkerList: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- description: string (nullable = true)
| | |-- name: string (nullable = true)
|-- confidence: string (nullable = true)
|-- contrast: string (nullable = true)
|-- datasourceId: string (nullable = true)
|-- datatypeId: string (nullable = true)
|-- diseaseCellLines: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- id: string (nullable = true)
| | |-- name: string (nullable = true)
| | |-- tissue: string (nullable = true)
| | |-- tissueId: string (nullable = true)
|-- diseaseFromSource: string (nullable = true)
|-- diseaseFromSourceMappedId: string (nullable = true)
|-- expectedConfidence: string (nullable = true)
|-- projectDescription: string (nullable = true)
|-- projectId: string (nullable = true)
|-- releaseDate: string (nullable = true)
|-- releaseVersion: string (nullable = true)
|-- resourceScore: double (nullable = true)
|-- statisticalTestTail: string (nullable = true)
|-- studyOverview: string (nullable = true)
|-- targetFromSourceId: string (nullable = true)
|-- validationHypotheses: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- description: string (nullable = true)
| | |-- name: string (nullable = true)
| | |-- status: string (nullable = true)
root
|-- assays: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- description: string (nullable = true)
| | |-- isHit: boolean (nullable = true)
| | |-- shortName: string (nullable = true)
|-- assessment: string (nullable = true)
|-- biomarkerList: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- description: string (nullable = true)
| | |-- name: string (nullable = true)
|-- datasourceId: string (nullable = true)
|-- datatypeId: string (nullable = true)
|-- diseaseCellLines: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- id: string (nullable = true)
| | |-- name: string (nullable = true)
| | |-- tissue: string (nullable = true)
| | |-- tissueId: string (nullable = true)
|-- diseaseFromSource: string (nullable = true)
|-- diseaseFromSourceMappedId: string (nullable = true)
|-- primaryProjectHit: boolean (nullable = true)
|-- primaryProjectId: string (nullable = true)
|-- projectId: string (nullable = true)
|-- releaseDate: string (nullable = true)
|-- releaseVersion: string (nullable = true)
|-- resourceScore: double (nullable = true)
|-- studyOverview: string (nullable = true)
|-- targetFromSourceId: string (nullable = true)
New attributes
root
|-- assays: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- description: string (nullable = true)
| | |-- isHit: boolean (nullable = true)
| | |-- shortName: string (nullable = true)
|-- assessment: string (nullable = true)
|-- primaryProjectHit: boolean (nullable = true)
|-- primaryProjectId: string (nullable = true)
REMOVED attributes
root
|-- confidence: string (nullable = true)
|-- contrast: string (nullable = true)
|-- expectedConfidence: string (nullable = true)
|-- projectDescription: string (nullable = true)
|-- statisticalTestTail: string (nullable = true)
|-- validationHypotheses: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- description: string (nullable = true)
| | |-- name: string (nullable = true)
| | |-- status: string (nullable = true)
Hi @mbdebian I have a draft evidence set here:
gs://ot-team/dsuveges/vl.json.gz/part-00000-e2c940e4-6ecb-4386-bfb2-f4df1991443f-c000.json
This is based on a mock data from validation lab, but we are expecting to get a full dataset soon.
The following columns can be removed from the evidence data model:
- validationHypotheses. With all its nested fields.
- expectedConfidence
These columns are not being addressed in the ETL, the API will be searched for these
The ETL is ok with the new given validation lab sample dataset at gs://ot-team/dsuveges/new_validation_lab.json.gz
These checks have been completed in the ETL branch issue_3298, which will stay open in order to check the pipeline with the real dataset.
The next checkpoints in the data journey are Opensearch and Open Targets GraphQL API
The API Evidence data model has been updated, and its 24.1.0-dev.4 tag release is deployed in our development environments for platform and PPP for testing @opentargets/data-team @opentargets/fe-team , please, provide feedback on the new 'evidence' data model that comes with this API. Thanks!
@prashantuniyal02 , BE works on this have been completed, and no problems beyond the API have been reported, would it be possible to flag this issue as completed?
Released in PPP
The Validation Lab is ready to share Breast Cancer data with us for integration into PPP. The dataset adds a number of additional data points to the colorectal one (currently in PPP)-such as three different contrast assays - and presents a challenge related to how to assess the hit and how to visualise this assessment. Moreover, we may need to consider adding two new gold standard datasets to the widget: Project Score v2 and Dep Map.
@DSuveges please feel free to add to or edit this initial ticket.