Closed tskir closed 3 years ago
The way the mousePhenotypes dataset is currently generated is by using the output of the soon-to-be-deprecated data_pipeline, as follows:
cat 21.06_gene-data.json | jq -r '{"id":.id,"phenotypes": [.mouse_phenotypes[]?] }|@JSON' > mouse_phenotypes.json
Both the output of the old target step and the mouse_phenotypes.json file are available here: gs://open-targets-data-releases/21.06/input/datapipeline-dump/
Given that this dataset has not been updated since its conception and in the context of the target rewrite, we want to prioritise updating both the schema and the data.
Current schema:
root
|-- id: string (nullable = true)
|-- phenotypes: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- mouse_gene_id: string (nullable = true)
| | |-- mouse_gene_symbol: string (nullable = true)
| | |-- phenotypes: array (nullable = true)
| | | |-- element: struct (containsNull = true)
| | | | |-- category_mp_identifier: string (nullable = true)
| | | | |-- category_mp_label: string (nullable = true)
| | | | |-- genotype_phenotype: array (nullable = true)
| | | | | |-- element: struct (containsNull = true)
| | | | | | |-- mp_identifier: string (nullable = true)
| | | | | | |-- mp_label: string (nullable = true)
| | | | | | |-- pmid: string (nullable = true)
| | | | | | |-- subject_allelic_composition: string (nullable = true)
| | | | | | |-- subject_background: string (nullable = true)
A related ticket to perhaps explain the provenance of this dataset: https://github.com/opentargets/platform/issues/379
I compared the existing mouse phenotype target object with a corresponding evidence object, because they are very similar and this could bring about some insights. The comparison spreadsheet is here, and the discussions to be resolved are listed below.
Observed discrepancy:
mouse_gene_id
in target uses MGI gene identifiers e.g. MGI:87859.targetInModelId
in evidence uses Ensembl Mouse identifiers e.g. ENSMUSG00000026842 (in fact the schema does not allow for any other format).Additional considerations:
Possible solutions:
category_mp_identifier
e.g. MP:0005385,
category_mp_label
e.g. cardiovascular system phenotype).resourceScore
, diseaseFromSource
, diseaseFromSourceId
, diseaseModelAssociatedHumanPhenotypes
).biologicalModelId
: currently only present in evidence, needs to be included in the target as well (one of the points in https://github.com/opentargets/platform/issues/1642).pmid
: currently only present in target, needs to be included in the evidence as well.While investigating model changes for the target/evidence objects, I managed to reconstruct the entire workflow which is currently in place. I provide it here for discussion and for general provenance.
The processing starts in platform-input-support which downloads two flat files: https://github.com/opentargets/platform-input-support/blob/3021d903d84e72053bad444e7447e2b978d007a7/config.yaml#L24-L29
HMD_HumanPhenotype.rpt
provides human gene to mouse gene mapping. It contains the columns:
MGI_PhenoGenoMP.rpt
provides mouse gene to mouse phenotype mapping, and also the PubMed IDs supporting the association. It contains the columns:
These two files were then picked up by the mousephenotypes
module in the now-archived data_pipeline
repository and processed to generate the corresponding part of the target object.
Finally, as described by Irene, the portion of the target object is extracted as a separate mousePhenotypes
dataset:
cat 21.06_gene-data.json | jq -r '{"id":.id,"phenotypes": [.mouse_phenotypes[]?] }|@JSON' > mouse_phenotypes.json
If this workflow, as reconstructed, was indeed run in full for each release, then the mousePhenotypes
data is not obsolete, because the files in http://www.informatics.jax.org/downloads/reports/ are periodically updated according to the timestamps.
I have looked through all fields of all tables of the PhenoDigm SOLR data, and can confirm that the PubMed IDs are not contained anywhere. This is presumably because the focus of the PhenoDigm data is on human disease mapping, and not specifically on mouse gene to phenotype evidence.
This means that the PubMed IDs would need to be ingested from the MGI_PhenoGenoMP.rpt
, just as they were ingested in the old workflow.
However, we can't just reuse the existing parsing approach. This is because some entries in that file combine information for multiple models/genes, for example:
These cases are numerous (77612 / 351566 ≈ 22%) and, unfortunately, not properly handled by the current parsing approach. To complicate things, the count and order of alleles, gene names, and gene IDs is not always consistent within the record, so it would require a careful investigation and ingestion approach.
The top level MP phenotype categories aren't present anywhere in the flat files or the PhenoDigm data. The existing approach discovered the phenotype category for each term by parsing and walking through the entire MP ontology by using the now-deprecated ontologyutils
module.
The suggested alternative is to use the pronto
library, the same one already used in OnToma, to ingest MP and do a similar lookup. This should be relatively straightforward.
mousePhenotypes
schema and circulate to backend/frontend teams so that they can start work without waiting for full implementation.mousePhenotypes
data in the main PhenoDigm script.A first iteration of the mouse phenotypes have been ingested in the infrastructure and FE changes have been scoped in #1639
There is a fundamental problem that it's kind of bothering me. How useful is to provide the user with (in the most extreme case) 1828 different mouse phenotype entries. Do we have any way to sort them? can we summarise the information in any meaningful way? Something to think about.
>>> mp.groupBy("targetFromSourceId").count().sort(F.col("count").desc()).show(10)
+------------------+-----+
|targetFromSourceId|count|
+------------------+-----+
| ENSG00000157404| 1828|
| ENSG00000141510| 1573|
| ENSG00000116678| 1298|
| ENSG00000068078| 1261|
| ENSG00000049130| 1187|
| ENSG00000066468| 998|
| ENSG00000187098| 984|
| ENSG00000077782| 906|
| ENSG00000160789| 899|
| ENSG00000206573| 891|
+------------------+-----+
only showing top 10 rows
Targets with the largest number of distinct phenotypes:
>>> mp.select("targetFromSourceId", "modelPhenotypeId").distinct().groupBy("targetFromSourceId").count().sort(F.col("count").desc()).show(10)
+------------------+-----+
|targetFromSourceId|count|
+------------------+-----+
| ENSG00000206573| 280|
| ENSG00000066468| 277|
| ENSG00000116678| 274|
| ENSG00000141510| 244|
| ENSG00000160789| 241|
| ENSG00000157404| 222|
| ENSG00000164867| 220|
| ENSG00000174697| 217|
| ENSG00000054598| 209|
| ENSG00000232810| 207|
+------------------+-----+
only showing top 10 rows
After discussion with @andrewhercules and @ireneisdoomed, we agreed there are some differences with the phenodigm evidence that might require an extra level of aggregation in the schema.
We want to make a couple of suggestions to modify the current schema:
root
|-- biologicalModelAllelicComposition: string (nullable = true)
|-- biologicalModelGeneticBackground: string (nullable = true)
|-- biologicalModelId: string (nullable = true)
|-- literature: array (nullable = true)
| |-- element: string (containsNull = true)
|-- modelPhenotypeClassId: string (nullable = true)
|-- modelPhenotypeClassLabel: string (nullable = true)
|-- modelPhenotypeId: string (nullable = true)
|-- modelPhenotypeLabel: string (nullable = true)
|-- targetFromSourceId: string (nullable = true)
|-- targetInModel: string (nullable = true)
|-- targetInModelEnsemblId: string (nullable = true)
|-- targetInModelMgiId: string (nullable = true)
Next, the suggested schema (I hope it's clear, I edited it manually):
root
|-- biologicalModels: array of structs
| |-- allelicComposition: string (nullable = true)
| |-- geneticBackground: string (nullable = true)
| |-- id: string (nullable = true)
| |-- literature: array (nullable = true)
| | |-- element: string (containsNull = true)
| |-- targetInModel: string (nullable = true)
| |-- targetInModelId: string (nullable = true)
| |-- targetInModelMgiId: string (nullable = true)
|-- modelPhenotypeClasses: array of structs
| |-- id: string (nullable = true)
| |-- label: string (nullable = true)
|-- modelPhenotypeId: string (nullable = true)
|-- modelPhenotypeLabel: string (nullable = true)
|-- targetFromSourceId: string (nullable = true)
This schema would imply a clear reduction in the number of entries. For the case of KIT (ENSG00000157404), we would go down from 1828 to 222 entries in the favour of some nested information.
@tskir please have a look and let us know what your thoughts are.
@d0choa, the targetInModel
, targetInModelId
, and targetInModelMgiId
are within the biologicalModels
array. I think that would cause extra duplication that could be reduced by moving them up a level to be a child element of the mousePhenotypes
array.
In order to create a table that looks like the proposed design specification, those fields should be outside of the array so that we can display them in their own columns with relevant links or drawer components. That would change the schema to something like the following:
query targetInfo {
target(ensemblId: "ENSG00000145777") {
id
approvedSymbol
mousePhenotypes{
biologicalModels{
allelicComposition
geneticBackground
id
literature
}
modelPhenotypeClasses{
id
label
}
modelPhenotypeId
modelPhenotypeLabel
targetFromSourceId
targetInModel
targetInModelId
targetInModelMgiId
}
}
}
What do you think? Does that work data-wise?
I also noted that literature
should still stay in the biologicalModels
array because the same phenotype and phenotype category pairing might have different models with different literature references — see below for an example:
Yes absolutely. It's an error on my schema
Instead, I update the schema to prevent confusion (your query has also an error because it contains 2 publications):
root
|-- biologicalModels: array of structs
| |-- allelicComposition: string (nullable = true)
| |-- geneticBackground: string (nullable = true)
| |-- id: string (nullable = true)
| |-- literature: array (nullable = true)
| | |-- element: string (containsNull = true)
|-- targetInModel: string (nullable = true)
|-- targetInModelId: string (nullable = true)
|-- targetInModelMgiId: string (nullable = true)
|-- modelPhenotypeClasses: array of structs
| |-- id: string (nullable = true)
| |-- label: string (nullable = true)
|-- modelPhenotypeId: string (nullable = true)
|-- modelPhenotypeLabel: string (nullable = true)
|-- targetFromSourceId: string (nullable = true)
Thank you @andrewhercules @d0choa. I can confirm that conceptually the schema posted in the latest comment by @d0choa makes sense.
The only caveat we have to consider is that, by giving up a completely flat schema, we make certain queries more complicated. This applies both to front end implementation and to direct queries by the users.
For example, two sensible queries which I can think of are “filter by certain phenotype class” and “filter by certain model characteristics (allelic composition, genetic background)”. They are trivial with the flat schema, but somewhat more difficult with the nested one. And at least the first query (phenotype class one) is very useful and will definitely has to be implemented on the platform website.
Given all of this, I wonder if an alternative solution would be for the schema to remain flat and easy for querying, and for the front end to do the necessary aggregations for representation purposes only?
However, assuming we want to proceed with changing the schema anyway, some additional comments & questions:
MP:0000001
class (root) to every record, exploding the number of rows. Once this is resolved, we're back to the old number of records.
Investigations, part 1
Investigations, part 2
Evidence JSON schema changes
mousePhenotypes
model changesEvidence generation changes
mousePhenotypes
generationSchema discussions
Reviews and changes, first iteration
Review and changes, second iteration