Improve the schema of the `mousePhenotypes` dataset

opentargets / issues

Issue tracker for Open Targets Platform and Open Targets Genetics Portal

https://platform.opentargets.org https://genetics.opentargets.org

Apache License 2.0

12 stars 2 forks source link

Improve the schema of the `mousePhenotypes` dataset #1471

Closed tskir closed 3 years ago

tskir commented 3 years ago

Investigations, part 1

[x] Investigate existing workflow and data
[x] Compare evidence and target fields
[x] Compile a summary for content discussions

Investigations, part 2

[x] Data model content discussions
[x] Plan the remaining work
[x] Reconstruct the existing workflow
[x] Describe required steps for reimplementation

Evidence JSON schema changes

[x] Separate Ensembl and MGI identifiers for animal models
[x] Add PMID field
[x] Submit the PR: https://github.com/opentargets/json_schema/pull/127
[x] Merge the PR

mousePhenotypes model changes

[x] Finalise and describe the data model
[x] Report the necessary backend changes: https://github.com/opentargets/platform/issues/1641
[x] Report the necessary frontend changes: https://github.com/opentargets/platform/issues/1639

Evidence generation changes

[x] Amend command line parser
[x] Download and ingest PubMed references flat file
[x] Separate MGI and Ensembl target identifiers
[x] Add PubMed references to the dataset
[x] Submit PR for the evidence generation changes: https://github.com/opentargets/evidence_datasource_parsers/pull/90
[x] Address all review
[x] Merge the PR

mousePhenotypes generation

[x] Reorganise dataset construction and writing
[x] Reorganise gene data preparation
[x] Reorganise phenotype data preparation
[x] Reorganise literature data preparation
[x] Reorganise model ID cleanup
[x] Reorganise ontology data preparation
[x] Annotate phenotype classes
[x] Update execution environments
[x] Submit the PR: https://github.com/opentargets/evidence_datasource_parsers/pull/92

Schema discussions

[x] Review the latest schema version and discuss flat vs. nested schema
[x] Discuss current state, regression fixes, and roadmap of the changes

Reviews and changes, first iteration

[x] Remove MP:0000001 from phenotype classes
[x] Convert the schema to partially nested
[x] Address the reviews
[x] Update the environment
[x] Test run and generate the results
[x] Resubmit the PR for review, second iteration

Review and changes, second iteration

[x] Make sure all points from #1642 are addressed
[x] Upload mouse phenotypes dataset to a separate bucket
[x] Merge the PR
[x] Confirm the data was passed to the relevant teams

ireneisdoomed commented 3 years ago

Context

The way the mousePhenotypes dataset is currently generated is by using the output of the soon-to-be-deprecated data_pipeline, as follows:

cat 21.06_gene-data.json | jq -r '{"id":.id,"phenotypes": [.mouse_phenotypes[]?] }|@JSON' > mouse_phenotypes.json

Both the output of the old target step and the mouse_phenotypes.json file are available here: gs://open-targets-data-releases/21.06/input/datapipeline-dump/

Given that this dataset has not been updated since its conception and in the context of the target rewrite, we want to prioritise updating both the schema and the data.

ireneisdoomed commented 3 years ago

Current schema:

root
 |-- id: string (nullable = true)
 |-- phenotypes: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- mouse_gene_id: string (nullable = true)
 |    |    |-- mouse_gene_symbol: string (nullable = true)
 |    |    |-- phenotypes: array (nullable = true)
 |    |    |    |-- element: struct (containsNull = true)
 |    |    |    |    |-- category_mp_identifier: string (nullable = true)
 |    |    |    |    |-- category_mp_label: string (nullable = true)
 |    |    |    |    |-- genotype_phenotype: array (nullable = true)
 |    |    |    |    |    |-- element: struct (containsNull = true)
 |    |    |    |    |    |    |-- mp_identifier: string (nullable = true)
 |    |    |    |    |    |    |-- mp_label: string (nullable = true)
 |    |    |    |    |    |    |-- pmid: string (nullable = true)
 |    |    |    |    |    |    |-- subject_allelic_composition: string (nullable = true)
 |    |    |    |    |    |    |-- subject_background: string (nullable = true)

ireneisdoomed commented 3 years ago

A related ticket to perhaps explain the provenance of this dataset: https://github.com/opentargets/platform/issues/379

tskir commented 3 years ago

Data model content discussions

I compared the existing mouse phenotype target object with a corresponding evidence object, because they are very similar and this could bring about some insights. The comparison spreadsheet is here, and the discussions to be resolved are listed below.

Specifying the mouse gene ID

Observed discrepancy:

mouse_gene_id in target uses MGI gene identifiers e.g. MGI:87859.
targetInModelId in evidence uses Ensembl Mouse identifiers e.g. ENSMUSG00000026842 (in fact the schema does not allow for any other format).

Additional considerations:

Both identifiers have value and are used in real life. However, PhenoDigm source data always uses MGI identifiers.
Correspondence between the identifiers is not one-to-one. MGI maps to either one or none Ensembl IDs, and multiple MGI IDs can map to the same Ensembl ID. In other words, MGI {M} → {0,1} Ensembl.
Both identifiers could be used to construct automated URLs, such as http://www.informatics.jax.org/marker/MGI:87859 and http://www.ensembl.org/Mus_musculus/Gene/Summary?g=ENSMUSG00000026842. Both pages contain useful information as well as cross-links to the other representation.

Possible solutions:

[x] Always provide both MGI and Ensembl identifiers (easier for the users; suggested solution).
Only use MGI (easier for us).

Fields specific to only one of the objects — confirm this is as planned

[x] Only in target: phenotype category fields (category_mp_identifier e.g. MP:0005385, category_mp_label e.g. cardiovascular system phenotype).
[x] Only in evidence: human phenotype related fields (resourceScore, diseaseFromSource, diseaseFromSourceId, diseaseModelAssociatedHumanPhenotypes).

Fields specific to only one of the objects — changes required, confirm

[x] biologicalModelId: currently only present in evidence, needs to be included in the target as well (one of the points in https://github.com/opentargets/platform/issues/1642).
[x] pmid: currently only present in target, needs to be included in the evidence as well.

tskir commented 3 years ago

Reconstruction of the existing workflow

While investigating model changes for the target/evidence objects, I managed to reconstruct the entire workflow which is currently in place. I provide it here for discussion and for general provenance.

The processing starts in platform-input-support which downloads two flat files: https://github.com/opentargets/platform-input-support/blob/3021d903d84e72053bad444e7447e2b978d007a7/config.yaml#L24-L29

HMD_HumanPhenotype.rpt provides human gene to mouse gene mapping. It contains the columns:

Human Marker Symbol
Human Entrez Gene ID
Mouse Marker Symbol
MGI Marker Accession ID
High-level Mammalian Phenotype ID (comma-delimited)

MGI_PhenoGenoMP.rpt provides mouse gene to mouse phenotype mapping, and also the PubMed IDs supporting the association. It contains the columns:

Allelic Composition
Allele Symbol(s)
Genetic Background
Mammalian Phenotype ID
PubMed ID
MGI Marker Accession ID (comma-delimited)

These two files were then picked up by the mousephenotypes module in the now-archived data_pipeline repository and processed to generate the corresponding part of the target object.

Finally, as described by Irene, the portion of the target object is extracted as a separate mousePhenotypes dataset:

cat 21.06_gene-data.json | jq -r '{"id":.id,"phenotypes": [.mouse_phenotypes[]?] }|@JSON' > mouse_phenotypes.json

If this workflow, as reconstructed, was indeed run in full for each release, then the mousePhenotypes data is not obsolete, because the files in http://www.informatics.jax.org/downloads/reports/ are periodically updated according to the timestamps.

tskir commented 3 years ago

Changes required for reimplementation

PubMed IDs

I have looked through all fields of all tables of the PhenoDigm SOLR data, and can confirm that the PubMed IDs are not contained anywhere. This is presumably because the focus of the PhenoDigm data is on human disease mapping, and not specifically on mouse gene to phenotype evidence.

This means that the PubMed IDs would need to be ingested from the MGI_PhenoGenoMP.rpt, just as they were ingested in the old workflow.

However, we can't just reuse the existing parsing approach. This is because some entries in that file combine information for multiple models/genes, for example:

Foxc2/Foxc2,Pax1/Pax1
Foxc2|Pax1
involves: 129P2/OlaHsd 129S1/Sv C57BL/6
MP:0000733
10364424
MGI:1347481|MGI:97485

These cases are numerous (77612 / 351566 ≈ 22%) and, unfortunately, not properly handled by the current parsing approach. To complicate things, the count and order of alleles, gene names, and gene IDs is not always consistent within the record, so it would require a careful investigation and ingestion approach.

Phenotype categories

The top level MP phenotype categories aren't present anywhere in the flat files or the PhenoDigm data. The existing approach discovered the phenotype category for each term by parsing and walking through the entire MP ontology by using the now-deprecated ontologyutils module.

The suggested alternative is to use the pronto library, the same one already used in OnToma, to ingest MP and do a similar lookup. This should be relatively straightforward.

Proposed sequence of changes

Submit changes to the evidence JSON schema so that we can freeze it ASAP.
Finalise the mousePhenotypes schema and circulate to backend/frontend teams so that they can start work without waiting for full implementation.
Implement construction of the mousePhenotypes data in the main PhenoDigm script.
Add PMIDs to the evidence data.

d0choa commented 3 years ago

A first iteration of the mouse phenotypes have been ingested in the infrastructure and FE changes have been scoped in #1639

There is a fundamental problem that it's kind of bothering me. How useful is to provide the user with (in the most extreme case) 1828 different mouse phenotype entries. Do we have any way to sort them? can we summarise the information in any meaningful way? Something to think about.

>>> mp.groupBy("targetFromSourceId").count().sort(F.col("count").desc()).show(10)
+------------------+-----+
|targetFromSourceId|count|
+------------------+-----+
|   ENSG00000157404| 1828|
|   ENSG00000141510| 1573|
|   ENSG00000116678| 1298|
|   ENSG00000068078| 1261|
|   ENSG00000049130| 1187|
|   ENSG00000066468|  998|
|   ENSG00000187098|  984|
|   ENSG00000077782|  906|
|   ENSG00000160789|  899|
|   ENSG00000206573|  891|
+------------------+-----+
only showing top 10 rows

Targets with the largest number of distinct phenotypes:

>>> mp.select("targetFromSourceId", "modelPhenotypeId").distinct().groupBy("targetFromSourceId").count().sort(F.col("count").desc()).show(10)
+------------------+-----+
|targetFromSourceId|count|
+------------------+-----+
|   ENSG00000206573|  280|
|   ENSG00000066468|  277|
|   ENSG00000116678|  274|
|   ENSG00000141510|  244|
|   ENSG00000160789|  241|
|   ENSG00000157404|  222|
|   ENSG00000164867|  220|
|   ENSG00000174697|  217|
|   ENSG00000054598|  209|
|   ENSG00000232810|  207|
+------------------+-----+
only showing top 10 rows

d0choa commented 3 years ago

After discussion with @andrewhercules and @ireneisdoomed, we agreed there are some differences with the phenodigm evidence that might require an extra level of aggregation in the schema.

We want to make a couple of suggestions to modify the current schema:

root
 |-- biologicalModelAllelicComposition: string (nullable = true)
 |-- biologicalModelGeneticBackground: string (nullable = true)
 |-- biologicalModelId: string (nullable = true)
 |-- literature: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- modelPhenotypeClassId: string (nullable = true)
 |-- modelPhenotypeClassLabel: string (nullable = true)
 |-- modelPhenotypeId: string (nullable = true)
 |-- modelPhenotypeLabel: string (nullable = true)
 |-- targetFromSourceId: string (nullable = true)
 |-- targetInModel: string (nullable = true)
 |-- targetInModelEnsemblId: string (nullable = true)
 |-- targetInModelMgiId: string (nullable = true)

Next, the suggested schema (I hope it's clear, I edited it manually):

root
 |-- biologicalModels: array of structs
 |    |-- allelicComposition: string (nullable = true)
 |    |-- geneticBackground: string (nullable = true)
 |    |-- id: string (nullable = true)
 |    |-- literature: array (nullable = true)
 |    |    |-- element: string (containsNull = true)
 |    |-- targetInModel: string (nullable = true)
 |    |-- targetInModelId: string (nullable = true)
 |    |-- targetInModelMgiId: string (nullable = true)
 |-- modelPhenotypeClasses: array of structs
 |    |-- id: string (nullable = true)
 |    |-- label: string (nullable = true)
 |-- modelPhenotypeId: string (nullable = true)
 |-- modelPhenotypeLabel: string (nullable = true)
 |-- targetFromSourceId: string (nullable = true)

This schema would imply a clear reduction in the number of entries. For the case of KIT (ENSG00000157404), we would go down from 1828 to 222 entries in the favour of some nested information.

@tskir please have a look and let us know what your thoughts are.

andrewhercules commented 3 years ago

@d0choa, the targetInModel, targetInModelId, and targetInModelMgiId are within the biologicalModels array. I think that would cause extra duplication that could be reduced by moving them up a level to be a child element of the mousePhenotypes array.

In order to create a table that looks like the proposed design specification, those fields should be outside of the array so that we can display them in their own columns with relevant links or drawer components. That would change the schema to something like the following:

query targetInfo {
  target(ensemblId: "ENSG00000145777") {
    id
    approvedSymbol
    mousePhenotypes{
      biologicalModels{
        allelicComposition
        geneticBackground
        id
        literature
      }
      modelPhenotypeClasses{
        id
        label
      }
      modelPhenotypeId
      modelPhenotypeLabel
      targetFromSourceId
      targetInModel
      targetInModelId
      targetInModelMgiId
    }
  }
}

What do you think? Does that work data-wise?

I also noted that literature should still stay in the biologicalModels array because the same phenotype and phenotype category pairing might have different models with different literature references — see below for an example:

d0choa commented 3 years ago

Yes absolutely. It's an error on my schema

Instead, I update the schema to prevent confusion (your query has also an error because it contains 2 publications):

root
 |-- biologicalModels: array of structs
 |    |-- allelicComposition: string (nullable = true)
 |    |-- geneticBackground: string (nullable = true)
 |    |-- id: string (nullable = true)
 |    |-- literature: array (nullable = true)
 |    |    |-- element: string (containsNull = true)
 |-- targetInModel: string (nullable = true)
 |-- targetInModelId: string (nullable = true)
 |-- targetInModelMgiId: string (nullable = true)
 |-- modelPhenotypeClasses: array of structs
 |    |-- id: string (nullable = true)
 |    |-- label: string (nullable = true)
 |-- modelPhenotypeId: string (nullable = true)
 |-- modelPhenotypeLabel: string (nullable = true)
 |-- targetFromSourceId: string (nullable = true)

tskir commented 3 years ago

Thank you @andrewhercules @d0choa. I can confirm that conceptually the schema posted in the latest comment by @d0choa makes sense.

The only caveat we have to consider is that, by giving up a completely flat schema, we make certain queries more complicated. This applies both to front end implementation and to direct queries by the users.

For example, two sensible queries which I can think of are “filter by certain phenotype class” and “filter by certain model characteristics (allelic composition, genetic background)”. They are trivial with the flat schema, but somewhat more difficult with the nested one. And at least the first query (phenotype class one) is very useful and will definitely has to be implemented on the platform website.

Given all of this, I wonder if an alternative solution would be for the schema to remain flat and easy for querying, and for the front end to do the necessary aggregations for representation purposes only?

tskir commented 3 years ago

However, assuming we want to proceed with changing the schema anyway, some additional comments & questions:

We currently already display upwards of 1,000 records in the “mouse phenotypes” section, for example https://platform.opentargets.org/target/ENSG00000157404, so what we observe in the new data isn't really a new issue.
The additional observed growth in the number of records (1113 to 1828 in the case of ENSG00000157404) appears to be due to a bug in the current version of the PR which adds MP:0000001 class (root) to every record, exploding the number of rows. Once this is resolved, we're back to the old number of records.
Given the two points above, and also considering that changing the schema to the partially nested one will take some work (not huge work, but still), affecting all parts (data, BE, FE), do we definitely want to do all of the changes for this release? Could we split this into several iterations, and do the nesting & the associated changes for the next release?