Mouse Phenotypes Column for Target Prioritisation

opentargets / issues

Issue tracker for Open Targets Platform and Open Targets Genetics Portal

https://platform.opentargets.org https://genetics.opentargets.org

Apache License 2.0

12 stars 2 forks source link

Mouse Phenotypes Column for Target Prioritisation #3118

Closed Juanmaria-rr closed 7 months ago

Juanmaria-rr commented 9 months ago

Background

We need to incorporate in the Target Prioritisation view a new column to inform safety using mouse KO models. For that, we have used the mouse phenotypes reported on every KO models and classified them regarding their severity using the high level classification of Phenotype Classes. The scores are aggregated per target and a mouse Phenotype Score is built using the Harmonic Sum.

This dataset informs for more than 12.000 targets and shows some predictive ability for human Safety Liabilities.

Code and data availability:

The code for the new column is available in Target Engine repo: src/data_flow/target_properties_wb.py and the high level scores are in /src/data_flow/phenotypeScores/20230825_mousePheScores.csv .

Tasks

[x] Transform PySpark code to Scala and use the Harmonic Sum for aggregating all scores per target
[x] Codify scores from 0 to -1; null for targets with out mouse KO data.
[x] Compare results
[x] Update PIS to have mouse phenotype scores

jdhayhurst commented 8 months ago

Inputs:

mousePhenotypes
new input (requires PIS change): mouse phenotype scores

Transformations:

as https://github.com/opentargets/target_engine/blob/main/src/data_flow/target_properties_wb.py Input (1) is the output from the current mouse phenotypes ETL step. The transformations in the python file are aimed at producing a different "score" than is currently in place.

Output: The same as the current with the "score" field changed - no changes to the API should be required.

jdhayhurst commented 8 months ago

@Juanmaria-rr I've made the ETL changes, run locally and uploaded the parquet to this bucket: gs://open-targets-pre-data-releases/jhpis/output/etl/parquet/targetPrioritisation for verification. I replaced the field name from hasMouseKO to MouseKOScore - this will require a small FE change to pick that up. I kept all your logic with one change which may not have been necessary for you either because of the way pyspark/python handles the effect scores or because of the way the data are organised, but I added a descending sort operation to the scores before running the harmonic sum. Based on this, I'm fairly sure this sort is required, but wanted to check it with you in case I'm introducing something wrong. Cheers!

jdhayhurst commented 8 months ago

leaving ticket open until data has been reviewed

buniello commented 8 months ago

Another relevant task for this issue (@carcruz) is:

[ ] move the Mouse KO column from duability to safety in target prioritisation view before public release
[ ] Mouse KO column will be renamed to Mouse models

@Juanmaria-rr please let me know if there is any change we should make in the column documentation (e.g. new scores)

Juanmaria-rr commented 8 months ago

Hi. I compared @jdhayhurst results with mine, and we got the same numbers, excepting "0". @jdhayhurst , I think there is a typo in those targets with 0, because they appear as "-0". Could you please fix it?

jdhayhurst commented 8 months ago

Thanks @Juanmaria-rr, well spotted. I will fix this in the ETL.

buniello commented 8 months ago

FE testing of new column is in progress

Juanmaria-rr commented 8 months ago

Below you can find the score distribution (in positive, before flip to negative):

Due to the distribution of the mouse phenotype scores, a relevant number of the values are in the range of being deep red.

We would like to transform the values so scores belonging to the lowest 25% (that is 0.60) would appear as 0, while the rest would be linearly transformed from 0 to -1 .

There is already code implemented in BE for some columns that could be used. For instance, the next code could be used (taken from mouse ortholog column):

.withColumn(
    "mouseScores",
   F.when(F.col("mousePhenotypeScore") <0.60, 
              F.lit(0))
   .when( F.col("mousePhenotypeScore") >=0.60, 
              F.lit((F.col("mousePhenotypeScore") - 0.6) / 0.40)
    )
)

jdhayhurst commented 8 months ago

Changes to ETL done and run locally. @Juanmaria-rr I've updated the files in the google bucket gs://open-targets-pre-data-releases/jhpis/output/etl/parquet/targetPrioritisation - please let me know if they look correct. Thanks!

Juanmaria-rr commented 8 months ago

I checked the values of the re-scaled score and seems to be good. This is the new distribution of the mouse phenotypes score: