opentargets / issues

Issue tracker for Open Targets Platform and Open Targets Genetics Portal
https://platform.opentargets.org https://genetics.opentargets.org
Apache License 2.0
12 stars 2 forks source link

Target object iteration 7 review #1537

Closed ktsirigos closed 3 years ago

ktsirigos commented 3 years ago

Objective is to be able to provide feedback to @JarrodBaker on potential issues on the latest target iteration and also on diagnosing remaining tasks.

d0choa commented 3 years ago

Pasting my findings here as I notice things worth following up:

>>> new.withColumn("trac", F.explode("tractability")).select("approvedSymbol","trac.*").show(5)
+--------------+--------------------+--------+-----+
|approvedSymbol|                  id|modality|value|
+--------------+--------------------+--------+-----+
|         SCYL3|       Approved Drug|      SM|false|
|         SCYL3|   Advanced Clinical|      SM|false|
|         SCYL3|    Phase 1 Clinical|      SM|false|
|         SCYL3|Structure with Li...|      SM|false|
|         SCYL3| High-Quality Ligand|      SM|false|
+--------------+--------------------+--------+-----+
only showing top 5 rows

>>> new.withColumn("trac", F.explode("tractability")).select("approvedSymbol","trac.*").groupBy("value").count().show()
+-----+------+
|value| count|
+-----+------+
| true| 56272|
|false|489700|
+-----+------+
>>> new.select(F.explode("transcriptIds")).select(F.col("col.*")).select("source").distinct().show()
+-----------+
|     source|
+-----------+
|Ensembl_TRA|
+-----------+
>>> new.select("id", "geneticConstraint.*").show(5)
+---------------+--------------------+--------+---------+---------+
|             id|         constraints|upperBin|upperBin6|upperRank|
+---------------+--------------------+--------+---------+---------+
|ENSG00000000457|[[syn, 146.92, 13...|       2|        1|     3857|
|ENSG00000000971|[[syn, 222.53, 25...|       1|        0|     2667|
|ENSG00000003402|[[syn, 96.864, 84...|       0|        0|      775|
|ENSG00000006659|[[syn, 36.654, 36...|       8|        4|    15696|
|ENSG00000006744|[[syn, 180.47, 21...|       5|        3|    11220|
+---------------+--------------------+--------+---------+---------+
only showing top 5 rows

>>> new.withColumn("cons", F.explode("geneticConstraint.constraints")).select("approvedSymbol", "cons.*").show(5)
+--------------+--------------+------+---+-------+-------+-------+-------+
|approvedSymbol|constraintType|   exp|obs|     oe|oeLower|oeUpper|  score|
+--------------+--------------+------+---+-------+-------+-------+-------+
|         SCYL3|           syn|146.92|136|0.92567|  0.804|  1.067|0.70818|
|         SCYL3|           mis|386.49|332|0.85902|  0.784|  0.941|0.98492|
|         SCYL3|           lof| 34.32|  8| 0.2331|  0.136|  0.421|0.28151|
|           CFH|           syn|222.53|256| 1.1504|  1.038|  1.276|-1.7637|
|           CFH|           mis|660.11|588|0.89076|  0.832|  0.954|0.99735|
+--------------+--------------+------+---+-------+-------+-------+-------+
only showing top 5 rows
andrewhercules commented 3 years ago

I agree with @d0choa about removing the value column and only keeping rows where value == true. The underlying dataset is quite sparse and returning all rows will make the API response larger than it needs to be.

JarrodBaker commented 3 years ago

The only argument I can see for adding a source to transcriptIds is if we expect there is a high likelihood of further sources being added in the future. Overall the flatter the structure the easier it is to work with so I'd vote for ditching it for now (using an array as David suggested) and add it back in later should it become necessary.

JarrodBaker commented 3 years ago

I've updated the Target step with the following changes:

The issue with TEP was a bad input file and is now fixed.

In relation to GO we will create a separate index like HPO. I've created a ticket which I'll polish this week.

As far as I know the safety liabilities structure is as requested from the data team, so I haven't made any changes there.

The outputs can be found at gs://ot-team/jarrod/target-outputs/v8.

I'm now moving onto updating the API to serve the new schema, and identifying which of the downstream steps in the ETL require updating (hint ~ basically all of them)

If there is no more feedback on this ticket I'll close it on Friday.

DSuveges commented 3 years ago

My comments on the changes:

I have one question/comment about alternative identifiers: some genes are seemingly split into separate target entries with no link between them. Eg. CCL4L2. If we look up this symbol we have the following entires:

+---------------+----------------------------------+------------------------------------+--------------+--------------------------------------------+
|id             |alternativeGenes                  |approvedName                        |approvedSymbol|genomicLocation                             |
+---------------+----------------------------------+------------------------------------+--------------+--------------------------------------------+
|ENSG00000275313|[ENSG00000276125, ENSG00000282604]|C-C motif chemokine ligand 4 like 2 |CCL4L2        |[CHR_HSCHR17_10_CTG4, 36314484, 36312669, 1]|
|ENSG00000276070|null                              |C-C motif chemokine ligand 4 like 2 |CCL4L2        |[17, 36212878, 36210924, 1]                 |
+---------------+----------------------------------+------------------------------------+--------------+--------------------------------------------+

I think this gene is probably split into two entries because the genes on the alternative locations have reviewed protein products (as on ticket #1381 ), however I don't know why there's no link between them: the alternativeGene field is set for the first one, listing two alternative gene identifiers, however all three are alternatives for the fourth (ENSG00000276070) gene ID (as also shown on the Ensembl search results). Also when looking at targets with alternativeGenes, the location is always a scaffold, not a canonical chromosome. I think it would make sense to somehow link all the "alternative" gene IDs to the gene on the canonical chromosome as we are supporting target search by ensembl gene identifier. (This might cause confusion on the search though when the same gene symbol is listed twice without an further information provided, but such cases already exist eg. for U6, U2 etc)

d0choa commented 3 years ago

My understanding is the same as @DSuveges. That these 2 entries should indeed be 1 entry in which all the others become alternativeGenes of ENSG00000276070. Sorry because I might not have been very clear in my previous specifications.

  1. All genes in 1-23, X, Y, MT stay
  2. For the ones in other assemblies:
    1. Discard if they don't encode for a Uniprot swissprot ("reviewed") protein
    2. If they encode for a reviewed protein
      1. if they have the same approvedSymbol as an already accepted target include them as alternativeGenes of the first one
      2. if they contain a novel approvedSymbol
        1. Take the longest gene as reference entry
        2. Include all the rest as alternativeGenes

How does it sound @JarrodBaker and @DSuveges ?

DSuveges commented 3 years ago

@d0choa sounds about right, I only see one caveat though: if a gene on an alternative location has approved protein product, that protein product might get separate annotations potentially conflicting with the one located on the canonical chromosomes (2.ii.a). But it would surely have marginal impact.

I see no problem if all gene IDs on alternative assemblies would be added to alternativeGenes if the approvedSymbol matches with a gene on canonical chromosomes.

ireneisdoomed commented 3 years ago

The target safety schema has a new version (iter 4):

  1. safetyLiabilities shall be a separate dataset from the target object, so in addition to the evidence object, we shall specify the id of the target.
  2. effects is no longer an aggregation of effect direction and dosing. For each event, we want to show how the target is modulated in an array of objects.
  3. If effects.dosing or effects.direction are not specified in the source, this field will not be present. This is the case for example for Bowes et al. (2012), where we never have dosing information.
    • For the adverse effects dataset we did not take into account these fields could be unspecified, therefore they were not captured.
    • The events for Bowes et al. (2012) are not captured, probably as a consequence of the last point. The ones we are capturing are the ones for Lynch et al. (2017) for the same gene (cell E62 and I62). Spreadsheet here.
      
      (v6
      .filter(col("id") == "ENSG00000133019")
      .filter(col("evidence.datasource") == "Bowes et al. (2012)")
      .select("evidence.event").distinct()
      .show(100,truncate=False))

+------------------------------------------+ |event | +------------------------------------------+ |decreased pupil diameter | |liver failure | |centrilobular liver congestion | |muscle cramps | |increased pupil diameter | |lacrimation | |bronchospasm | |blurred vision | |abdominal cramps | |irritability | |bronchorrhea | |acidosis | |bronchoconstriction | |decreased intestinal transit | |ptosis | |constipation | |exhaustion | |decreased gastric emptying | |increased salivation | |diarrhea | |increased respiratory rate | |decreased salivation | |increased/decreased blood pressure | |pleural effusions | |increased body temperature | |dry mouth | |frothing | |increased/decreased heart rate | |neutrophil collection within the sinusoids| |sweating | |cough | +------------------------------------------+

4. New array of structs `assay`. For Tox21 we can have several assay results supporting the same event.
5. New dataset for the Tox21 source. The current input file is pulled by PIS. The new file containing the new fields that were missing in the data for the last iter is located in this bucket: `otar001-core/ExperimentalToxicity/ToxCast_data-2020-05-21.tsv`. Note that this file at the moment only collects Tox21 data, we will be review the other experimental toxicity data source (eTOX) in a similar way.
6. We are losing associations that were present in the production index for Bowes et al. (2012) (23), HeCaToS (25) and Urban et al. (2012) (25). This is probably related to the issue mentioned above. Some examples: 

Bowes et al. (2012)

(v6 .filter(col("id") == "ENSG00000181072") .filter(col("evidence.datasource") == "Bowes et al. (2012)") .select("evidence") .show(100,truncate=False)) --> None

Urban et al. (2012)

(v6 .filter(col("id") == "ENSG00000180210") .filter(col("evidence.datasource") == "Urban et al. (2012)") .select("evidence") .show(100,truncate=False)) --> None

Hecatos

(v6 .filter(col("id") == "ENSG00000006638") .filter(col("evidence.datasource") == "HeCaToS") .select("evidence") .show(100,truncate=False)) --> None



Please share any comments, doubts, etc.
ireneisdoomed commented 3 years ago

Another point, in Lynch et al. (2017) "Developmental Toxicity" is described as an additional case in terms of dosage besides acute and chronic.

Although developmental toxicity assays can occur over a range of doses, this is a particular scenario of this data source, so we won't do any distinction and reflect these cases in the dosing field.

Therefore, effects.dosing will collect 3 different possibilities: "chronic", "acute", and "Developmental toxicity".