opentargets / issues

Issue tracker for Open Targets Platform and Open Targets Genetics Portal
https://platform.opentargets.org https://genetics.opentargets.org
Apache License 2.0
12 stars 2 forks source link

Improve the schema of the `mousePhenotypes` dataset #1471

Closed tskir closed 3 years ago

tskir commented 3 years ago

Investigations, part 1

Investigations, part 2

Evidence JSON schema changes

mousePhenotypes model changes

Evidence generation changes

mousePhenotypes generation

Schema discussions

Reviews and changes, first iteration

Review and changes, second iteration

ireneisdoomed commented 3 years ago

Context

The way the mousePhenotypes dataset is currently generated is by using the output of the soon-to-be-deprecated data_pipeline, as follows:

cat 21.06_gene-data.json | jq -r '{"id":.id,"phenotypes": [.mouse_phenotypes[]?] }|@JSON' > mouse_phenotypes.json

Both the output of the old target step and the mouse_phenotypes.json file are available here: gs://open-targets-data-releases/21.06/input/datapipeline-dump/

Given that this dataset has not been updated since its conception and in the context of the target rewrite, we want to prioritise updating both the schema and the data.

ireneisdoomed commented 3 years ago

Current schema:

root
 |-- id: string (nullable = true)
 |-- phenotypes: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- mouse_gene_id: string (nullable = true)
 |    |    |-- mouse_gene_symbol: string (nullable = true)
 |    |    |-- phenotypes: array (nullable = true)
 |    |    |    |-- element: struct (containsNull = true)
 |    |    |    |    |-- category_mp_identifier: string (nullable = true)
 |    |    |    |    |-- category_mp_label: string (nullable = true)
 |    |    |    |    |-- genotype_phenotype: array (nullable = true)
 |    |    |    |    |    |-- element: struct (containsNull = true)
 |    |    |    |    |    |    |-- mp_identifier: string (nullable = true)
 |    |    |    |    |    |    |-- mp_label: string (nullable = true)
 |    |    |    |    |    |    |-- pmid: string (nullable = true)
 |    |    |    |    |    |    |-- subject_allelic_composition: string (nullable = true)
 |    |    |    |    |    |    |-- subject_background: string (nullable = true)
ireneisdoomed commented 3 years ago

A related ticket to perhaps explain the provenance of this dataset: https://github.com/opentargets/platform/issues/379

tskir commented 3 years ago

Data model content discussions

I compared the existing mouse phenotype target object with a corresponding evidence object, because they are very similar and this could bring about some insights. The comparison spreadsheet is here, and the discussions to be resolved are listed below.

Specifying the mouse gene ID

Observed discrepancy:

Additional considerations:

Possible solutions:

Fields specific to only one of the objects — confirm this is as planned

Fields specific to only one of the objects — changes required, confirm

tskir commented 3 years ago

Reconstruction of the existing workflow

While investigating model changes for the target/evidence objects, I managed to reconstruct the entire workflow which is currently in place. I provide it here for discussion and for general provenance.

The processing starts in platform-input-support which downloads two flat files: https://github.com/opentargets/platform-input-support/blob/3021d903d84e72053bad444e7447e2b978d007a7/config.yaml#L24-L29

HMD_HumanPhenotype.rpt provides human gene to mouse gene mapping. It contains the columns:

MGI_PhenoGenoMP.rpt provides mouse gene to mouse phenotype mapping, and also the PubMed IDs supporting the association. It contains the columns:

These two files were then picked up by the mousephenotypes module in the now-archived data_pipeline repository and processed to generate the corresponding part of the target object.

Finally, as described by Irene, the portion of the target object is extracted as a separate mousePhenotypes dataset:

cat 21.06_gene-data.json | jq -r '{"id":.id,"phenotypes": [.mouse_phenotypes[]?] }|@JSON' > mouse_phenotypes.json

If this workflow, as reconstructed, was indeed run in full for each release, then the mousePhenotypes data is not obsolete, because the files in http://www.informatics.jax.org/downloads/reports/ are periodically updated according to the timestamps.

tskir commented 3 years ago

Changes required for reimplementation

PubMed IDs

I have looked through all fields of all tables of the PhenoDigm SOLR data, and can confirm that the PubMed IDs are not contained anywhere. This is presumably because the focus of the PhenoDigm data is on human disease mapping, and not specifically on mouse gene to phenotype evidence.

This means that the PubMed IDs would need to be ingested from the MGI_PhenoGenoMP.rpt, just as they were ingested in the old workflow.

However, we can't just reuse the existing parsing approach. This is because some entries in that file combine information for multiple models/genes, for example:

These cases are numerous (77612 / 351566 ≈ 22%) and, unfortunately, not properly handled by the current parsing approach. To complicate things, the count and order of alleles, gene names, and gene IDs is not always consistent within the record, so it would require a careful investigation and ingestion approach.

Phenotype categories

The top level MP phenotype categories aren't present anywhere in the flat files or the PhenoDigm data. The existing approach discovered the phenotype category for each term by parsing and walking through the entire MP ontology by using the now-deprecated ontologyutils module.

The suggested alternative is to use the pronto library, the same one already used in OnToma, to ingest MP and do a similar lookup. This should be relatively straightforward.

Proposed sequence of changes

  1. Submit changes to the evidence JSON schema so that we can freeze it ASAP.
  2. Finalise the mousePhenotypes schema and circulate to backend/frontend teams so that they can start work without waiting for full implementation.
  3. Implement construction of the mousePhenotypes data in the main PhenoDigm script.
  4. Add PMIDs to the evidence data.
d0choa commented 3 years ago

A first iteration of the mouse phenotypes have been ingested in the infrastructure and FE changes have been scoped in #1639

There is a fundamental problem that it's kind of bothering me. How useful is to provide the user with (in the most extreme case) 1828 different mouse phenotype entries. Do we have any way to sort them? can we summarise the information in any meaningful way? Something to think about.

>>> mp.groupBy("targetFromSourceId").count().sort(F.col("count").desc()).show(10)
+------------------+-----+
|targetFromSourceId|count|
+------------------+-----+
|   ENSG00000157404| 1828|
|   ENSG00000141510| 1573|
|   ENSG00000116678| 1298|
|   ENSG00000068078| 1261|
|   ENSG00000049130| 1187|
|   ENSG00000066468|  998|
|   ENSG00000187098|  984|
|   ENSG00000077782|  906|
|   ENSG00000160789|  899|
|   ENSG00000206573|  891|
+------------------+-----+
only showing top 10 rows

Targets with the largest number of distinct phenotypes:

>>> mp.select("targetFromSourceId", "modelPhenotypeId").distinct().groupBy("targetFromSourceId").count().sort(F.col("count").desc()).show(10)
+------------------+-----+
|targetFromSourceId|count|
+------------------+-----+
|   ENSG00000206573|  280|
|   ENSG00000066468|  277|
|   ENSG00000116678|  274|
|   ENSG00000141510|  244|
|   ENSG00000160789|  241|
|   ENSG00000157404|  222|
|   ENSG00000164867|  220|
|   ENSG00000174697|  217|
|   ENSG00000054598|  209|
|   ENSG00000232810|  207|
+------------------+-----+
only showing top 10 rows
d0choa commented 3 years ago

After discussion with @andrewhercules and @ireneisdoomed, we agreed there are some differences with the phenodigm evidence that might require an extra level of aggregation in the schema.

We want to make a couple of suggestions to modify the current schema:

root
 |-- biologicalModelAllelicComposition: string (nullable = true)
 |-- biologicalModelGeneticBackground: string (nullable = true)
 |-- biologicalModelId: string (nullable = true)
 |-- literature: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- modelPhenotypeClassId: string (nullable = true)
 |-- modelPhenotypeClassLabel: string (nullable = true)
 |-- modelPhenotypeId: string (nullable = true)
 |-- modelPhenotypeLabel: string (nullable = true)
 |-- targetFromSourceId: string (nullable = true)
 |-- targetInModel: string (nullable = true)
 |-- targetInModelEnsemblId: string (nullable = true)
 |-- targetInModelMgiId: string (nullable = true)

Next, the suggested schema (I hope it's clear, I edited it manually):

root
 |-- biologicalModels: array of structs
 |    |-- allelicComposition: string (nullable = true)
 |    |-- geneticBackground: string (nullable = true)
 |    |-- id: string (nullable = true)
 |    |-- literature: array (nullable = true)
 |    |    |-- element: string (containsNull = true)
 |    |-- targetInModel: string (nullable = true)
 |    |-- targetInModelId: string (nullable = true)
 |    |-- targetInModelMgiId: string (nullable = true)
 |-- modelPhenotypeClasses: array of structs
 |    |-- id: string (nullable = true)
 |    |-- label: string (nullable = true)
 |-- modelPhenotypeId: string (nullable = true)
 |-- modelPhenotypeLabel: string (nullable = true)
 |-- targetFromSourceId: string (nullable = true)

This schema would imply a clear reduction in the number of entries. For the case of KIT (ENSG00000157404), we would go down from 1828 to 222 entries in the favour of some nested information.

@tskir please have a look and let us know what your thoughts are.

andrewhercules commented 3 years ago

@d0choa, the targetInModel, targetInModelId, and targetInModelMgiId are within the biologicalModels array. I think that would cause extra duplication that could be reduced by moving them up a level to be a child element of the mousePhenotypes array.

In order to create a table that looks like the proposed design specification, those fields should be outside of the array so that we can display them in their own columns with relevant links or drawer components. That would change the schema to something like the following:

query targetInfo {
  target(ensemblId: "ENSG00000145777") {
    id
    approvedSymbol
    mousePhenotypes{
      biologicalModels{
        allelicComposition
        geneticBackground
        id
        literature
      }
      modelPhenotypeClasses{
        id
        label
      }
      modelPhenotypeId
      modelPhenotypeLabel
      targetFromSourceId
      targetInModel
      targetInModelId
      targetInModelMgiId
    }
  }
}

What do you think? Does that work data-wise?

I also noted that literature should still stay in the biologicalModels array because the same phenotype and phenotype category pairing might have different models with different literature references — see below for an example:

Screenshot 2021-08-31 at 18 55 28
d0choa commented 3 years ago

Yes absolutely. It's an error on my schema

Instead, I update the schema to prevent confusion (your query has also an error because it contains 2 publications):

root
 |-- biologicalModels: array of structs
 |    |-- allelicComposition: string (nullable = true)
 |    |-- geneticBackground: string (nullable = true)
 |    |-- id: string (nullable = true)
 |    |-- literature: array (nullable = true)
 |    |    |-- element: string (containsNull = true)
 |-- targetInModel: string (nullable = true)
 |-- targetInModelId: string (nullable = true)
 |-- targetInModelMgiId: string (nullable = true)
 |-- modelPhenotypeClasses: array of structs
 |    |-- id: string (nullable = true)
 |    |-- label: string (nullable = true)
 |-- modelPhenotypeId: string (nullable = true)
 |-- modelPhenotypeLabel: string (nullable = true)
 |-- targetFromSourceId: string (nullable = true)
tskir commented 3 years ago

Thank you @andrewhercules @d0choa. I can confirm that conceptually the schema posted in the latest comment by @d0choa makes sense.

The only caveat we have to consider is that, by giving up a completely flat schema, we make certain queries more complicated. This applies both to front end implementation and to direct queries by the users.

For example, two sensible queries which I can think of are “filter by certain phenotype class” and “filter by certain model characteristics (allelic composition, genetic background)”. They are trivial with the flat schema, but somewhat more difficult with the nested one. And at least the first query (phenotype class one) is very useful and will definitely has to be implemented on the platform website.

Given all of this, I wonder if an alternative solution would be for the schema to remain flat and easy for querying, and for the front end to do the necessary aggregations for representation purposes only?

tskir commented 3 years ago

However, assuming we want to proceed with changing the schema anyway, some additional comments & questions: