Open buniello opened 1 year ago
@buniello Thank you for a really productive meeting the other day! This is the repository which contains the current implementation and all accompanying information: https://github.com/opentargets/baseline-expression
I'm currently in the process of re-running this all in order to generate a proper JSON schema. I'm going to put it under https://github.com/opentargets/json_schema/tree/master/schemas, is that OK?
Thank you @tskir!
Hi @buniello, I uploaded the baseline expression data to: gs://otar000-evidence_input/BaselineExpression/json/baseline-expression-2023-07-31.json.gz
The annotated JSON schema for the data is available in this PR: https://github.com/opentargets/json_schema/pull/172 — it is not yet merged, but there won't be any breaking changes, so you could start using it in the time being.
As far as I know, there are no more actions on my side to deliver this. Please let me know if I can help further with any aspects of this task :upside_down_face:
Hi @mbdebian and @carcruz! While Annalisa is on annual leave, I'll take over with scoping and planning for this issue. I was given to understand that we really want this functionality in the 23.09 release, so if this is to happen, we should start implementation soon.
In short, we want to incorporate this data
gs://otar000-evidence_input/BaselineExpression/json/baseline-expression-2023-07-31.json.gz
which follows this schema
https://github.com/opentargets/json_schema/blob/master/schemas/baseline_expression.json
into a new "Specificity" tab in the "Baseline expression" widget.
I've created a sketch for how this UI should look like (you can also look at it live at the whiteboard in Studio 1):
The data is packaged in a separate per-gene objects, so a single query could return everything which is required to populate this view. The only exception is the "all genes" distribution in the "Overall specificity" section, which must be computed over, well, all genes. But this is a nice-to-have rather than must-have feature, so we can skip it for now and figure out the details later.
I'm happy to have a catch up to discuss this whenever convenient for you, as well as answer any questions here or in Slack :upside_down_face:
Hi @chinmehta - see kirill's comments above to access data and schema for this feature. Let us know what you think!
As discussed on 23-08-2023, we will start BE implementation for this feature so that FE can scope the downstream work. The final product will be released as soon as Data will have bandwidth to update the GTEx dataset.
Discussed on leads meeting today: the final implementation of the updated widget is scoped for December to update GTEx data.
@buniello @opentargets/be-team @opentargets/fe-team
I've implemented direct ingestion of GTEx V8 data, which is the latest release. The code isn't yet merged awaiting review, but here's the link to the new data for testing purposes: gs://otar000-evidence_input/BaselineExpression/json/baseline_expression-2023-10-03.json.gz
.
Compared to the data I released 2023-07-31, there is one tiny schema change: instead of FPKM, the expression values are now reported in TPM. This only involves renaming the field fpkm
to tpm
and nothing else. Here the link to the JSON schema PR with this change.
Please let me know if I can help with anything at all for bringing this feature into production. Always happy to have a chat or address any questions that anyone might have.
After talking to @d0choa and @tskir. The BE tasks are the following
The new ETL step is going to validate the baseline_expression
file produced by the data team against the output from the target step. @tskir @d0choa @DSuveges Should we output the invalid genes (the ones that didn't had a match in the output from Target) to another output?
Should we output the invalid genes (the ones that didn't had a match in the output from Target) to another output?
Since my understanding we're doing this for other sources (output invalid evidence separately), I think it does make sense to do it similarly in this case, too
we only do it for evidence. It's okay not to produce an invalid baseline expression dataset but make sure you leave some log to keep this number under control
Hi @opentargets/be-team and @opentargets/be-team,
There's an important conceptual change that we dicsussed yesterday with @buniello and would like to see applied to the implementation of this task. It doesn't change anything about the data or its visualisation, and just involves high level changes; essentially where to put it. We hope that at this point of development it should be very easy to accommodate these changes, but please let us know if there are any problems.
Rather than expanding the existing "Baseline Expression" widget with a new "Specificity" tab, as shown on the sketch attached to this issue:
BASELINE EXPRESSION
---------------------------------------------------------------
*Specificity* | Summary | Experiments (E.A.) | Variation (GTEx)
We'd like to leave that widget unchanged, and instead add a new widget called "Expression Specificity":
BASELINE EXPRESSION
-----------------------------------------------
Summary | Experiments (E.A.) | Variation (GTEx)
*EXPRESSION SPECIFICITY*
------------------------
*Bulk (GTEx V8)*
The API and data should also be exposed as the expressionSpecificity
endpoint/download files. The existing baselineExpression
endpoints and files should not be changed.
The input files and the schema remain unchanged for now:
(The names are unchanged, because in the future, as this dataset grows and starts to accommodate more tissues and their hierarchy, it will start feeding into the "Baseline expression" widget as well — however, for now it should only feed into "Expression Specificity".)
As @lucy-adelaide is now finalising the single cell expression data, it's clear that it will be ready for production reasonably soon (not for this release, but for the March 2024 one). There are also plans to incorporate a proteomics dataset. This will necessitate adding further tabs, so the Expression Specificity will look like something like this:
EXPRESSION SPECIFICITY
-----------------------------------------------------------------------
Bulk (GTEx V8) | *Single cell (Tabula Sapiens)* | *Protein (Wang 2019)*
Adding those tabs to the "Baseline Expression" widget would definitely be an overkill, especially seeing as the expression specificity is a different modality from just displaying the expression values.
I'll update the output of the ETL to reflect the new name.
@tskir could you please confirm here that the data looks good and provide green light for FE scoping with next release?
@buniello @opentargets/be-team @opentargets/fe-team I can confirm that based on querying several genes, the API works good and returns the correct data from the 3rd of October message above. You have my green light :-)
Great, thank you Kirill
@buniello , are we scoping this for the next release? Could we close this issue for the backend team, as I think the work has been completed on our side? Thanks!
We need to scope tasks and capacities to integrate the baseline expression data into the Platform. @tskir and myself will initiate discussions this week and will populate/assign this ticket accordingly.