Scoping integration of baseline expression data into the Platform

buniello commented 1 year ago

We need to scope tasks and capacities to integrate the baseline expression data into the Platform. @tskir and myself will initiate discussions this week and will populate/assign this ticket accordingly.

[x] https://github.com/opentargets/issues/issues/3096 @tskir — everything completed

tskir commented 1 year ago

@buniello Thank you for a really productive meeting the other day! This is the repository which contains the current implementation and all accompanying information: https://github.com/opentargets/baseline-expression

I'm currently in the process of re-running this all in order to generate a proper JSON schema. I'm going to put it under https://github.com/opentargets/json_schema/tree/master/schemas, is that OK?

buniello commented 1 year ago

Thank you @tskir!

tskir commented 1 year ago

Hi @buniello, I uploaded the baseline expression data to: gs://otar000-evidence_input/BaselineExpression/json/baseline-expression-2023-07-31.json.gz

The annotated JSON schema for the data is available in this PR: https://github.com/opentargets/json_schema/pull/172 — it is not yet merged, but there won't be any breaking changes, so you could start using it in the time being.

As far as I know, there are no more actions on my side to deliver this. Please let me know if I can help further with any aspects of this task :upside_down_face:

tskir commented 1 year ago

Hi @mbdebian and @carcruz! While Annalisa is on annual leave, I'll take over with scoping and planning for this issue. I was given to understand that we really want this functionality in the 23.09 release, so if this is to happen, we should start implementation soon.

In short, we want to incorporate this data

gs://otar000-evidence_input/BaselineExpression/json/baseline-expression-2023-07-31.json.gz

which follows this schema

https://github.com/opentargets/json_schema/blob/master/schemas/baseline_expression.json

into a new "Specificity" tab in the "Baseline expression" widget.

I've created a sketch for how this UI should look like (you can also look at it live at the whiteboard in Studio 1):

The data is packaged in a separate per-gene objects, so a single query could return everything which is required to populate this view. The only exception is the "all genes" distribution in the "Overall specificity" section, which must be computed over, well, all genes. But this is a nice-to-have rather than must-have feature, so we can skip it for now and figure out the details later.

I'm happy to have a catch up to discuss this whenever convenient for you, as well as answer any questions here or in Slack :upside_down_face:

buniello commented 1 year ago

Hi @chinmehta - see kirill's comments above to access data and schema for this feature. Let us know what you think!

As discussed on 23-08-2023, we will start BE implementation for this feature so that FE can scope the downstream work. The final product will be released as soon as Data will have bandwidth to update the GTEx dataset.

ireneisdoomed commented 1 year ago

Discussed on leads meeting today: the final implementation of the updated widget is scoped for December to update GTEx data.

tskir commented 1 year ago

@buniello @opentargets/be-team @opentargets/fe-team

I've implemented direct ingestion of GTEx V8 data, which is the latest release. The code isn't yet merged awaiting review, but here's the link to the new data for testing purposes: gs://otar000-evidence_input/BaselineExpression/json/baseline_expression-2023-10-03.json.gz.

Compared to the data I released 2023-07-31, there is one tiny schema change: instead of FPKM, the expression values are now reported in TPM. This only involves renaming the field fpkm to tpm and nothing else. Here the link to the JSON schema PR with this change.

Please let me know if I can help with anything at all for bringing this feature into production. Always happy to have a chat or address any questions that anyone might have.

remo87 commented 1 year ago

After talking to @d0choa and @tskir. The BE tasks are the following

[x] Add the baselineExpression to the ETL
[x] Update POS to include the new data
[ ] Update the api to expose the baselineExpression information

remo87 commented 1 year ago

The new ETL step is going to validate the baseline_expression file produced by the data team against the output from the target step. @tskir @d0choa @DSuveges Should we output the invalid genes (the ones that didn't had a match in the output from Target) to another output?

tskir commented 1 year ago

Should we output the invalid genes (the ones that didn't had a match in the output from Target) to another output?

Since my understanding we're doing this for other sources (output invalid evidence separately), I think it does make sense to do it similarly in this case, too

d0choa commented 1 year ago

we only do it for evidence. It's okay not to produce an invalid baseline expression dataset but make sure you leave some log to keep this number under control

tskir commented 1 year ago

Hi @opentargets/be-team and @opentargets/be-team,

There's an important conceptual change that we dicsussed yesterday with @buniello and would like to see applied to the implementation of this task. It doesn't change anything about the data or its visualisation, and just involves high level changes; essentially where to put it. We hope that at this point of development it should be very easy to accommodate these changes, but please let us know if there are any problems.

The change

Rather than expanding the existing "Baseline Expression" widget with a new "Specificity" tab, as shown on the sketch attached to this issue:

BASELINE EXPRESSION
---------------------------------------------------------------
*Specificity* | Summary | Experiments (E.A.) | Variation (GTEx)

We'd like to leave that widget unchanged, and instead add a new widget called "Expression Specificity":

BASELINE EXPRESSION
-----------------------------------------------
Summary | Experiments (E.A.) | Variation (GTEx)

*EXPRESSION SPECIFICITY*
------------------------
*Bulk (GTEx V8)*

Endpoints, inputs, and downloads

The API and data should also be exposed as the expressionSpecificity endpoint/download files. The existing baselineExpression endpoints and files should not be changed.

The input files and the schema remain unchanged for now:

Code: https://github.com/opentargets/evidence_datasource_parsers/tree/master/modules/baseline_expression
Data: gs://otar000-evidence_input/BaselineExpression/json/baseline_expression-2023-10-03.json.gz
Schema: https://github.com/opentargets/json_schema/blob/master/schemas/baseline_expression.json

(The names are unchanged, because in the future, as this dataset grows and starts to accommodate more tissues and their hierarchy, it will start feeding into the "Baseline expression" widget as well — however, for now it should only feed into "Expression Specificity".)

Rationale and future plans

As @lucy-adelaide is now finalising the single cell expression data, it's clear that it will be ready for production reasonably soon (not for this release, but for the March 2024 one). There are also plans to incorporate a proteomics dataset. This will necessitate adding further tabs, so the Expression Specificity will look like something like this:

EXPRESSION SPECIFICITY
-----------------------------------------------------------------------
Bulk (GTEx V8) | *Single cell (Tabula Sapiens)* | *Protein (Wang 2019)*

Adding those tabs to the "Baseline Expression" widget would definitely be an overkill, especially seeing as the expression specificity is a different modality from just displaying the expression values.

remo87 commented 1 year ago

I'll update the output of the ETL to reflect the new name.

buniello commented 11 months ago

@tskir could you please confirm here that the data looks good and provide green light for FE scoping with next release?

tskir commented 11 months ago

@buniello @opentargets/be-team @opentargets/fe-team I can confirm that based on querying several genes, the API works good and returns the correct data from the 3rd of October message above. You have my green light :-)

buniello commented 11 months ago

Great, thank you Kirill

buniello commented 9 months ago

See here for latest version of dataset schema

mbdebian commented 7 months ago

@buniello , are we scoping this for the next release? Could we close this issue for the backend team, as I think the work has been completed on our side? Thanks!

opentargets / issues