ktsirigos commented 3 years ago

Objective is to be able to provide feedback to @JarrodBaker on potential issues on the latest target iteration and also on diagnosing remaining tasks.

d0choa commented 3 years ago

Pasting my findings here as I notice things worth following up:

[x] Less TEPs?? (Example: DHTKD1)

>>> new.filter(F.col("tep").isNotNull()).select("id", "tep.*").count()
30
>>> old.filter(F.col("tep").isNotNull()).select("id", "tep.*").count()
39
>>>

[x] In tractability, I would only keep true values and remove boolean column. @andrewhercules and @DSuveges please feel free to object

>>> new.withColumn("trac", F.explode("tractability")).select("approvedSymbol","trac.*").show(5)
+--------------+--------------------+--------+-----+
|approvedSymbol|                  id|modality|value|
+--------------+--------------------+--------+-----+
|         SCYL3|       Approved Drug|      SM|false|
|         SCYL3|   Advanced Clinical|      SM|false|
|         SCYL3|    Phase 1 Clinical|      SM|false|
|         SCYL3|Structure with Li...|      SM|false|
|         SCYL3| High-Quality Ligand|      SM|false|
+--------------+--------------------+--------+-----+
only showing top 5 rows

>>> new.withColumn("trac", F.explode("tractability")).select("approvedSymbol","trac.*").groupBy("value").count().show()
+-----+------+
|value| count|
+-----+------+
| true| 56272|
|false|489700|
+-----+------+

[x] all transcriptIds have source Ensembl_TRA. Since it's unlikely we have any other source for transcriptIds, I would suggest to remove the source column and flatten the object to an array of strings. The only advantage I see in this structure is that it's the same as proteinIds but I don't think this level of consistency is required.

>>> new.select(F.explode("transcriptIds")).select(F.col("col.*")).select("source").distinct().show()
+-----------+
|     source|
+-----------+
|Ensembl_TRA|
+-----------+

[x] I suspect safetyLiabilities is still WIP. Let me know if you need more feedback at some point @ireneisdoomed et al.

>>> new.filter(F.col("safetyLiabilities").isNotNull()).withColumn("safety", F.explode("safetyLiabilities")).select("id", "safety.event", "safety.eventId", "safety.datasource").filter(F.col("event").isNull()).show(10, truncate = False)
+---------------+-----+-------+--------------------+
|id             |event|eventId|datasource          |
+---------------+-----+-------+--------------------+
|ENSG00000067900|null |null   |Tox21               |
|ENSG00000067900|null |null   |Force et al. (2011) |
|ENSG00000067900|null |null   |HeCaToS             |
|ENSG00000067900|null |null   |Lamore et al. (2017)|
|ENSG00000077782|null |null   |Tox21               |
|ENSG00000102468|null |null   |Tox21               |
|ENSG00000110931|null |null   |Lamore et al. (2017)|
|ENSG00000117020|null |null   |Force et al. (2011) |
|ENSG00000117020|null |null   |HeCaToS             |
|ENSG00000117020|null |null   |Tox21               |
+---------------+-----+-------+--------------------+
only showing top 10 rows

This is just a thought for @DSuveges, no action required. The protein fusions in hallmarks have the gene symbol of the fused genes. It could be useful to map them to our targets at some point, but we should think about the context in which this could be useful.
[x] In go it's all looking great. However, we will need GO term label eventually for the FE. There are 2 ways. We either do as we did with hpo and create a separate dataset/index that resolves the entity in the BE, or alternatively, repeat the labels in the target object. I'm more keen for the first option, but happy to listen other thoughts.
[x] In geneticConstraint I think we need to modify the structure. The fields upperBin upperBin6 upperRank are currently at the root level. However, according to their definition (e.g. oe_lof_upper_bin) they refer to the lof constraintType. I believe it's better to include these fields in the nested structure and make it null for the other constraintType. Happy to discuss

>>> new.select("id", "geneticConstraint.*").show(5)
+---------------+--------------------+--------+---------+---------+
|             id|         constraints|upperBin|upperBin6|upperRank|
+---------------+--------------------+--------+---------+---------+
|ENSG00000000457|[[syn, 146.92, 13...|       2|        1|     3857|
|ENSG00000000971|[[syn, 222.53, 25...|       1|        0|     2667|
|ENSG00000003402|[[syn, 96.864, 84...|       0|        0|      775|
|ENSG00000006659|[[syn, 36.654, 36...|       8|        4|    15696|
|ENSG00000006744|[[syn, 180.47, 21...|       5|        3|    11220|
+---------------+--------------------+--------+---------+---------+
only showing top 5 rows

>>> new.withColumn("cons", F.explode("geneticConstraint.constraints")).select("approvedSymbol", "cons.*").show(5)
+--------------+--------------+------+---+-------+-------+-------+-------+
|approvedSymbol|constraintType|   exp|obs|     oe|oeLower|oeUpper|  score|
+--------------+--------------+------+---+-------+-------+-------+-------+
|         SCYL3|           syn|146.92|136|0.92567|  0.804|  1.067|0.70818|
|         SCYL3|           mis|386.49|332|0.85902|  0.784|  0.941|0.98492|
|         SCYL3|           lof| 34.32|  8| 0.2331|  0.136|  0.421|0.28151|
|           CFH|           syn|222.53|256| 1.1504|  1.038|  1.276|-1.7637|
|           CFH|           mis|660.11|588|0.89076|  0.832|  0.954|0.99735|
+--------------+--------------+------+---+-------+-------+-------+-------+
only showing top 5 rows

andrewhercules commented 3 years ago

I agree with @d0choa about removing the value column and only keeping rows where value == true. The underlying dataset is quite sparse and returning all rows will make the API response larger than it needs to be.

JarrodBaker commented 3 years ago

The only argument I can see for adding a source to transcriptIds is if we expect there is a high likelihood of further sources being added in the future. Overall the flatter the structure the easier it is to work with so I'd vote for ditching it for now (using an array as David suggested) and add it back in later should it become necessary.

JarrodBaker commented 3 years ago

I've updated the Target step with the following changes:

flattened geneticConstraint so that fields upperBin, upperBin6 and upperRank are at the same level as the other fields and have values when constraintType is lof and is otherwise null.
updated transcriptId so that there is no longer a listed source. If we wish to revert this in future revert commit 2d9a8730447653fe6ee87b4e010bdef6ae65da9a
tractability no longer has a value field I threw away all the false ones.

The issue with TEP was a bad input file and is now fixed.

In relation to GO we will create a separate index like HPO. I've created a ticket which I'll polish this week.

As far as I know the safety liabilities structure is as requested from the data team, so I haven't made any changes there.

The outputs can be found at gs://ot-team/jarrod/target-outputs/v8.

I'm now moving onto updating the API to serve the new schema, and identifying which of the downstream steps in the ETL require updating (hint ~ basically all of them)

If there is no more feedback on this ticket I'll close it on Friday.

DSuveges commented 3 years ago

My comments on the changes:

Removing transcriptId.source is a good idea: I think it is absolutely unlikely we would import transcripts from anywhere else but Ensembl.
On Hallmarks fusion proteins mentioned by @d0choa: yes, absolutely we should reference them. That would allow users the traverse across our target pages. I want to check two things before that: 1) do all fusion partners identified with their approved symbol 2) is the hallmark annotation symetric: if protein A has a hallmark annotation and is fused with protein B, the same hallmark annotation is available for protein B as well.

I have one question/comment about alternative identifiers: some genes are seemingly split into separate target entries with no link between them. Eg. CCL4L2. If we look up this symbol we have the following entires:

+---------------+----------------------------------+------------------------------------+--------------+--------------------------------------------+
|id             |alternativeGenes                  |approvedName                        |approvedSymbol|genomicLocation                             |
+---------------+----------------------------------+------------------------------------+--------------+--------------------------------------------+
|ENSG00000275313|[ENSG00000276125, ENSG00000282604]|C-C motif chemokine ligand 4 like 2 |CCL4L2        |[CHR_HSCHR17_10_CTG4, 36314484, 36312669, 1]|
|ENSG00000276070|null                              |C-C motif chemokine ligand 4 like 2 |CCL4L2        |[17, 36212878, 36210924, 1]                 |
+---------------+----------------------------------+------------------------------------+--------------+--------------------------------------------+

I think this gene is probably split into two entries because the genes on the alternative locations have reviewed protein products (as on ticket #1381 ), however I don't know why there's no link between them: the alternativeGene field is set for the first one, listing two alternative gene identifiers, however all three are alternatives for the fourth (ENSG00000276070) gene ID (as also shown on the Ensembl search results). Also when looking at targets with alternativeGenes, the location is always a scaffold, not a canonical chromosome. I think it would make sense to somehow link all the "alternative" gene IDs to the gene on the canonical chromosome as we are supporting target search by ensembl gene identifier. (This might cause confusion on the search though when the same gene symbol is listed twice without an further information provided, but such cases already exist eg. for U6, U2 etc)

d0choa commented 3 years ago

My understanding is the same as @DSuveges. That these 2 entries should indeed be 1 entry in which all the others become alternativeGenes of ENSG00000276070. Sorry because I might not have been very clear in my previous specifications.

All genes in 1-23, X, Y, MT stay
For the ones in other assemblies:
1. Discard if they don't encode for a Uniprot swissprot ("reviewed") protein
2. If they encode for a reviewed protein
  1. if they have the same approvedSymbol as an already accepted target include them as alternativeGenes of the first one
  2. if they contain a novel approvedSymbol
    1. Take the longest gene as reference entry
    2. Include all the rest as alternativeGenes

How does it sound @JarrodBaker and @DSuveges ?

DSuveges commented 3 years ago

@d0choa sounds about right, I only see one caveat though: if a gene on an alternative location has approved protein product, that protein product might get separate annotations potentially conflicting with the one located on the canonical chromosomes (2.ii.a). But it would surely have marginal impact.

I see no problem if all gene IDs on alternative assemblies would be added to alternativeGenes if the approvedSymbol matches with a gene on canonical chromosomes.

ireneisdoomed commented 3 years ago

The target safety schema has a new version (iter 4):

safetyLiabilities shall be a separate dataset from the target object, so in addition to the evidence object, we shall specify the id of the target.
effects is no longer an aggregation of effect direction and dosing. For each event, we want to show how the target is modulated in an array of objects.
If effects.dosing or effects.direction are not specified in the source, this field will not be present. This is the case for example for Bowes et al. (2012), where we never have dosing information.
- For the adverse effects dataset we did not take into account these fields could be unspecified, therefore they were not captured.
- The events for Bowes et al. (2012) are not captured, probably as a consequence of the last point. The ones we are capturing are the ones for Lynch et al. (2017) for the same gene (cell E62 and I62). Spreadsheet here.
```
(v6
.filter(col("id") == "ENSG00000133019")
.filter(col("evidence.datasource") == "Bowes et al. (2012)")
.select("evidence.event").distinct()
.show(100,truncate=False))
```

4. New array of structs `assay`. For Tox21 we can have several assay results supporting the same event.
5. New dataset for the Tox21 source. The current input file is pulled by PIS. The new file containing the new fields that were missing in the data for the last iter is located in this bucket: `otar001-core/ExperimentalToxicity/ToxCast_data-2020-05-21.tsv`. Note that this file at the moment only collects Tox21 data, we will be review the other experimental toxicity data source (eTOX) in a similar way.
6. We are losing associations that were present in the production index for Bowes et al. (2012) (23), HeCaToS (25) and Urban et al. (2012) (25). This is probably related to the issue mentioned above. Some examples:

Bowes et al. (2012)

(v6 .filter(col("id") == "ENSG00000181072") .filter(col("evidence.datasource") == "Bowes et al. (2012)") .select("evidence") .show(100,truncate=False)) --> None

Urban et al. (2012)

(v6 .filter(col("id") == "ENSG00000180210") .filter(col("evidence.datasource") == "Urban et al. (2012)") .select("evidence") .show(100,truncate=False)) --> None

Hecatos

(v6 .filter(col("id") == "ENSG00000006638") .filter(col("evidence.datasource") == "HeCaToS") .select("evidence") .show(100,truncate=False)) --> None



Please share any comments, doubts, etc.

ireneisdoomed commented 3 years ago

Another point, in Lynch et al. (2017) "Developmental Toxicity" is described as an additional case in terms of dosage besides acute and chronic.

Although developmental toxicity assays can occur over a range of doses, this is a particular scenario of this data source, so we won't do any distinction and reflect these cases in the dosing field.

Therefore, effects.dosing will collect 3 different possibilities: "chronic", "acute", and "Developmental toxicity".

opentargets / issues

Target object iteration 7 review #1537

Bowes et al. (2012)

Urban et al. (2012)

Hecatos