Closed ktsirigos closed 3 years ago
Pasting my findings here as I notice things worth following up:
[x] Less TEPs?? (Example: DHTKD1)
>>> new.filter(F.col("tep").isNotNull()).select("id", "tep.*").count()
30
>>> old.filter(F.col("tep").isNotNull()).select("id", "tep.*").count()
39
>>>
[x] In tractability, I would only keep true
values and remove boolean column. @andrewhercules and @DSuveges please feel free to object
>>> new.withColumn("trac", F.explode("tractability")).select("approvedSymbol","trac.*").show(5)
+--------------+--------------------+--------+-----+
|approvedSymbol| id|modality|value|
+--------------+--------------------+--------+-----+
| SCYL3| Approved Drug| SM|false|
| SCYL3| Advanced Clinical| SM|false|
| SCYL3| Phase 1 Clinical| SM|false|
| SCYL3|Structure with Li...| SM|false|
| SCYL3| High-Quality Ligand| SM|false|
+--------------+--------------------+--------+-----+
only showing top 5 rows
>>> new.withColumn("trac", F.explode("tractability")).select("approvedSymbol","trac.*").groupBy("value").count().show()
+-----+------+
|value| count|
+-----+------+
| true| 56272|
|false|489700|
+-----+------+
transcriptIds
, I would suggest to remove the source
column and flatten the object to an array of strings. The only advantage I see in this structure is that it's the same as proteinIds
but I don't think this level of consistency is required.>>> new.select(F.explode("transcriptIds")).select(F.col("col.*")).select("source").distinct().show()
+-----------+
| source|
+-----------+
|Ensembl_TRA|
+-----------+
[x] I suspect safetyLiabilities
is still WIP. Let me know if you need more feedback at some point @ireneisdoomed et al.
>>> new.filter(F.col("safetyLiabilities").isNotNull()).withColumn("safety", F.explode("safetyLiabilities")).select("id", "safety.event", "safety.eventId", "safety.datasource").filter(F.col("event").isNull()).show(10, truncate = False)
+---------------+-----+-------+--------------------+
|id |event|eventId|datasource |
+---------------+-----+-------+--------------------+
|ENSG00000067900|null |null |Tox21 |
|ENSG00000067900|null |null |Force et al. (2011) |
|ENSG00000067900|null |null |HeCaToS |
|ENSG00000067900|null |null |Lamore et al. (2017)|
|ENSG00000077782|null |null |Tox21 |
|ENSG00000102468|null |null |Tox21 |
|ENSG00000110931|null |null |Lamore et al. (2017)|
|ENSG00000117020|null |null |Force et al. (2011) |
|ENSG00000117020|null |null |HeCaToS |
|ENSG00000117020|null |null |Tox21 |
+---------------+-----+-------+--------------------+
only showing top 10 rows
This is just a thought for @DSuveges, no action required. The protein fusions in hallmarks have the gene symbol of the fused genes. It could be useful to map them to our targets at some point, but we should think about the context in which this could be useful.
[x] In go
it's all looking great. However, we will need GO term label eventually for the FE. There are 2 ways. We either do as we did with hpo
and create a separate dataset/index that resolves the entity in the BE, or alternatively, repeat the labels in the target object. I'm more keen for the first option, but happy to listen other thoughts.
[x] In geneticConstraint
I think we need to modify the structure. The fields upperBin
upperBin6
upperRank
are currently at the root level. However, according to their definition (e.g. oe_lof_upper_bin
) they refer to the lof
constraintType
. I believe it's better to include these fields in the nested structure and make it null
for the other constraintType
. Happy to discuss
>>> new.select("id", "geneticConstraint.*").show(5)
+---------------+--------------------+--------+---------+---------+
| id| constraints|upperBin|upperBin6|upperRank|
+---------------+--------------------+--------+---------+---------+
|ENSG00000000457|[[syn, 146.92, 13...| 2| 1| 3857|
|ENSG00000000971|[[syn, 222.53, 25...| 1| 0| 2667|
|ENSG00000003402|[[syn, 96.864, 84...| 0| 0| 775|
|ENSG00000006659|[[syn, 36.654, 36...| 8| 4| 15696|
|ENSG00000006744|[[syn, 180.47, 21...| 5| 3| 11220|
+---------------+--------------------+--------+---------+---------+
only showing top 5 rows
>>> new.withColumn("cons", F.explode("geneticConstraint.constraints")).select("approvedSymbol", "cons.*").show(5)
+--------------+--------------+------+---+-------+-------+-------+-------+
|approvedSymbol|constraintType| exp|obs| oe|oeLower|oeUpper| score|
+--------------+--------------+------+---+-------+-------+-------+-------+
| SCYL3| syn|146.92|136|0.92567| 0.804| 1.067|0.70818|
| SCYL3| mis|386.49|332|0.85902| 0.784| 0.941|0.98492|
| SCYL3| lof| 34.32| 8| 0.2331| 0.136| 0.421|0.28151|
| CFH| syn|222.53|256| 1.1504| 1.038| 1.276|-1.7637|
| CFH| mis|660.11|588|0.89076| 0.832| 0.954|0.99735|
+--------------+--------------+------+---+-------+-------+-------+-------+
only showing top 5 rows
I agree with @d0choa about removing the value
column and only keeping rows where value
== true
. The underlying dataset is quite sparse and returning all rows will make the API response larger than it needs to be.
The only argument I can see for adding a source to transcriptIds
is if we expect there is a high likelihood of further sources being added in the future. Overall the flatter the structure the easier it is to work with so I'd vote for ditching it for now (using an array as David suggested) and add it back in later should it become necessary.
I've updated the Target step with the following changes:
geneticConstraint
so that fields upperBin
, upperBin6
and upperRank
are at the same level as the other fields and have values when constraintType
is lof
and is otherwise null
. transcriptId
so that there is no longer a listed source. If we wish to revert this in future revert commit 2d9a8730447653fe6ee87b4e010bdef6ae65da9a
value
field I threw away all the false
ones.The issue with TEP was a bad input file and is now fixed.
In relation to GO we will create a separate index like HPO. I've created a ticket which I'll polish this week.
As far as I know the safety liabilities structure is as requested from the data team, so I haven't made any changes there.
The outputs can be found at gs://ot-team/jarrod/target-outputs/v8
.
I'm now moving onto updating the API to serve the new schema, and identifying which of the downstream steps in the ETL require updating (hint ~ basically all of them)
If there is no more feedback on this ticket I'll close it on Friday.
My comments on the changes:
transcriptId.source
is a good idea: I think it is absolutely unlikely we would import transcripts from anywhere else but Ensembl.I have one question/comment about alternative identifiers: some genes are seemingly split into separate target entries with no link between them. Eg. CCL4L2. If we look up this symbol we have the following entires:
+---------------+----------------------------------+------------------------------------+--------------+--------------------------------------------+
|id |alternativeGenes |approvedName |approvedSymbol|genomicLocation |
+---------------+----------------------------------+------------------------------------+--------------+--------------------------------------------+
|ENSG00000275313|[ENSG00000276125, ENSG00000282604]|C-C motif chemokine ligand 4 like 2 |CCL4L2 |[CHR_HSCHR17_10_CTG4, 36314484, 36312669, 1]|
|ENSG00000276070|null |C-C motif chemokine ligand 4 like 2 |CCL4L2 |[17, 36212878, 36210924, 1] |
+---------------+----------------------------------+------------------------------------+--------------+--------------------------------------------+
I think this gene is probably split into two entries because the genes on the alternative locations have reviewed protein products (as on ticket #1381 ), however I don't know why there's no link between them: the alternativeGene
field is set for the first one, listing two alternative gene identifiers, however all three are alternatives for the fourth (ENSG00000276070
) gene ID (as also shown on the Ensembl search results). Also when looking at targets with alternativeGenes
, the location is always a scaffold, not a canonical chromosome. I think it would make sense to somehow link all the "alternative" gene IDs to the gene on the canonical chromosome as we are supporting target search by ensembl gene identifier. (This might cause confusion on the search though when the same gene symbol is listed twice without an further information provided, but such cases already exist eg. for U6, U2 etc)
My understanding is the same as @DSuveges. That these 2 entries should indeed be 1 entry in which all the others become alternativeGenes
of ENSG00000276070
. Sorry because I might not have been very clear in my previous specifications.
alternativeGenes
of the first oneapprovedSymbol
alternativeGenes
How does it sound @JarrodBaker and @DSuveges ?
@d0choa sounds about right, I only see one caveat though: if a gene on an alternative location has approved protein product, that protein product might get separate annotations potentially conflicting with the one located on the canonical chromosomes (2.ii.a). But it would surely have marginal impact.
I see no problem if all gene IDs on alternative assemblies would be added to alternativeGenes
if the approvedSymbol
matches with a gene on canonical chromosomes.
The target safety schema has a new version (iter 4):
id
of the target.effects
is no longer an aggregation of effect direction and dosing. For each event, we want to show how the target is modulated in an array of objects. effects.dosing
or effects.direction
are not specified in the source, this field will not be present. This is the case for example for Bowes et al. (2012), where we never have dosing information.
(v6
.filter(col("id") == "ENSG00000133019")
.filter(col("evidence.datasource") == "Bowes et al. (2012)")
.select("evidence.event").distinct()
.show(100,truncate=False))
+------------------------------------------+ |event | +------------------------------------------+ |decreased pupil diameter | |liver failure | |centrilobular liver congestion | |muscle cramps | |increased pupil diameter | |lacrimation | |bronchospasm | |blurred vision | |abdominal cramps | |irritability | |bronchorrhea | |acidosis | |bronchoconstriction | |decreased intestinal transit | |ptosis | |constipation | |exhaustion | |decreased gastric emptying | |increased salivation | |diarrhea | |increased respiratory rate | |decreased salivation | |increased/decreased blood pressure | |pleural effusions | |increased body temperature | |dry mouth | |frothing | |increased/decreased heart rate | |neutrophil collection within the sinusoids| |sweating | |cough | +------------------------------------------+
4. New array of structs `assay`. For Tox21 we can have several assay results supporting the same event.
5. New dataset for the Tox21 source. The current input file is pulled by PIS. The new file containing the new fields that were missing in the data for the last iter is located in this bucket: `otar001-core/ExperimentalToxicity/ToxCast_data-2020-05-21.tsv`. Note that this file at the moment only collects Tox21 data, we will be review the other experimental toxicity data source (eTOX) in a similar way.
6. We are losing associations that were present in the production index for Bowes et al. (2012) (23), HeCaToS (25) and Urban et al. (2012) (25). This is probably related to the issue mentioned above. Some examples:
(v6 .filter(col("id") == "ENSG00000181072") .filter(col("evidence.datasource") == "Bowes et al. (2012)") .select("evidence") .show(100,truncate=False)) --> None
(v6 .filter(col("id") == "ENSG00000180210") .filter(col("evidence.datasource") == "Urban et al. (2012)") .select("evidence") .show(100,truncate=False)) --> None
(v6 .filter(col("id") == "ENSG00000006638") .filter(col("evidence.datasource") == "HeCaToS") .select("evidence") .show(100,truncate=False)) --> None
Please share any comments, doubts, etc.
Another point, in Lynch et al. (2017)
"Developmental Toxicity" is described as an additional case in terms of dosage besides acute and chronic.
Although developmental toxicity assays can occur over a range of doses, this is a particular scenario of this data source, so we won't do any distinction and reflect these cases in the dosing
field.
Therefore, effects.dosing
will collect 3 different possibilities: "chronic", "acute", and "Developmental toxicity".
Objective is to be able to provide feedback to @JarrodBaker on potential issues on the latest target iteration and also on diagnosing remaining tasks.