mbdebian commented 1 year ago

Background

ETL Disease step persists its analysis output according to this configuration block:

disease {
  efo-ontology {
    format = "json"
    path = ${common.input}"/ontology-inputs/ontology-efo.jsonl"
  }
  hpo-ontology {
    format = "json"
    path = ${common.input}"/ontology-inputs/ontology-hpo.jsonl"
  }
  mondo-ontology {
    format = "json"
    path = ${common.input}"/ontology-inputs/ontology-mondo.jsonl"
  }
  hpo-phenotype {
    format = "json"
    path = ${common.input}"/ontology-inputs/hpo-phenotypes.jsonl"
  }
  outputs = {
    diseases {
      format = ${common.output-format}
      path = ${common.output}"/diseases"
    }
    hpo {
      format = ${common.output-format}
      path = ${common.output}"/hpo"
    }
    disease-hpo {
      format = ${common.output-format}
      path = ${common.output}"/diseaseToPhenotype"
    }
  }
}

This issue is related to diseases output.

Diseases output data model

Within a data (pre)/releases bucket for a particular release, this output can be found at the relative path

output/etl/{parquet,json}/diseases

The schema looks like this 👇🏻

root
 |-- id: string (nullable = true)
 |-- code: string (nullable = true)
 |-- dbXRefs: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- description: string (nullable = true)
 |-- name: string (nullable = true)
 |-- directLocationIds: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- obsoleteTerms: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- parents: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- synonyms: struct (nullable = true)
 |    |-- hasBroadSynonym: array (nullable = true)
 |    |    |-- element: string (containsNull = true)
 |    |-- hasExactSynonym: array (nullable = true)
 |    |    |-- element: string (containsNull = true)
 |    |-- hasNarrowSynonym: array (nullable = true)
 |    |    |-- element: string (containsNull = true)
 |    |-- hasRelatedSynonym: array (nullable = true)
 |    |    |-- element: string (containsNull = true)
 |-- ancestors: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- descendants: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- children: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- therapeuticAreas: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- indirectLocationIds: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- ontology: struct (nullable = true)
 |    |-- isTherapeuticArea: boolean (nullable = true)
 |    |-- leaf: boolean (nullable = true)
 |    |-- sources: struct (nullable = true)
 |    |    |-- url: string (nullable = true)
 |    |    |-- name: string (nullable = true)

And a content sample is shown below 👇🏻

-RECORD 0-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 id                  | EFO_0000255                                                                                                                                                                                                                                                                                                    
 code                | http://www.ebi.ac.uk/efo/EFO_0000255                                                                                                                                                                                                                                                                           
 dbXRefs             | [NCIT:C7528, ICDO:9767/1, MESH:D007119, MONDO:0004977, GARD:11973, UMLS:C0020981, Orphanet:86886, ONCOTREE:AITL, ICDO:9705/3, EFO:0000255, GARD:0011973, ICD9:202.70, ICD10CM:C86.5, ICD10:C86.5, DOID:0111147, MedDRA:10002449, SCTID:413537009]                                                              
 description         | A mature T-cell non-Hodgkin lymphoma, characterized by systemic disease and a polymorphous infiltrate involving lymph nodes and extranodal sites. The clinical course is typically aggressive.                                                                                                                 
 name                | angioimmunoblastic T-cell lymphoma                                                                                                                                                                                                                                                                             
 directLocationIds   | null                                                                                                                                                                                                                                                                                                           
 obsoleteTerms       | null                                                                                                                                                                                                                                                                                                           
 parents             | [MONDO_0000430]                                                                                                                                                                                                                                                                                                
 synonyms            | {null, [angioimmunoblastic lymphadenopathy, angioimmunoblastic lymphadenopathy with Dysproteinemia, lymphogranulomatosis X, AILD, T-cell lymphoma, AILD type, angioimmunoblastic T-cell lymphoma, angioimmunoblastic lymphadenopathy type T-cell lymphoma, immunoblastic lymphadenopathy, AILT], null, [AITL]} 
 ancestors           | [MONDO_0000430, OTAR_0000018, MONDO_0002334, MONDO_0023370, MONDO_0044881, MONDO_0045024, Orphanet_322126, EFO_0002426, Orphanet_68336, EFO_0001642, EFO_0000574, MONDO_0015757, MONDO_0024615, EFO_0005952, EFO_0000508, MONDO_0019044, EFO_0005803, MONDO_0015760, EFO_0000616]                              
 descendants         | []                                                                                                                                                                                                                                                                                                             
 children            | []                                                                                                                                                                                                                                                                                                             
 therapeuticAreas    | [OTAR_0000018, MONDO_0045024, EFO_0005803]                                                                                                                                                                                                                                                                     
 indirectLocationIds | null                                                                                                                                                                                                                                                                                                           
 ontology            | {false, true, {http://www.ebi.ac.uk/efo/EFO_0000255, EFO_0000255}}                                                                                                                                                                                                                                             
only showing top 1 row

Data model refactoring proposal

Arrays of different types of synonyms to be renamed as follows
- hasBroadSynonym to broad
- hasExactSynonym to exact
- hasNarrowSynonym to narrow
- hasRelatedSynonym to related
leaf attribute in the ontology object to be renamed to isLeaf, so it matches its boolean type
sources within the ontology object to be renamed to source, as its type is struct and it will only contain a single object.

Data content refactoring proposal

Across the output data we have a mix of '[]' and null values for representing empty lists (arrays), it would be great if we could unify these into a default '[]' value for arrays, wherever they have no content. Other attributes of primtive types like string type would also benefit from having a meaningful default location, where possible and if applicable.

ireneisdoomed commented 1 year ago

Some comments you might want to consider:

Rename id to diseaseId. This will allow us to establish relationships between datasets more easily.
Rename code to url.
- Do we even need this? Could we be resolving the ID on identifiers.org?
Rename dbXrefs to crossReferences
Remove indirectLocationIds and directLocationIds.
- Not clear to me the content of these fields. They are null for all records except 34. I don't think we use them anywhere.
Evaluate the potential redundancy between the fields parents/ancestors and children/descendants.
- To clarify, parents and children refer to the immediate nodes directly above or below a specific ID, respectively. In contrast, ancestors and descendants capture all nodes when propagating upwards or downwards from a particular ID.
- I understand the distinction, but parents/children are essentially a subset of the ancestors/descendants that we could perhaps model differently. Do we use these anywhere? Perhaps when propagating the evidence.
Agree with the renaming of the synonyms fields.
Agree with the renaming of leaf to isLeaf

mbdebian commented 1 year ago

Thanks a lot for the comments @ireneisdoomed ! I’d like to add some in-line comments

Some comments you might want to consider:

Rename id to diseaseId. This will allow us to establish relationships between datasets more easily.

That would be really helpful when it comes to use this dataset 👍🏻

Rename code to url.

Do we even need this? Could we be resolving the ID on identifiers.org?

If we have Compact Identifiers, we can use identifiers.org resolution services, that’s correct. As EFO is an ontology, it has a special treatment on identifiers.org, and its URL could be

https://identifiers.org/EFO:0000255

Rename dbXrefs to crossReferences

👍🏻

Remove indirectLocationIds and directLocationIds.

Not clear to me the content of these fields. They are null for all records except 34. I don't think we use them anywhere.

Evaluate the potential redundancy between the fields parents/ancestors and children/descendants.

If they don’t bring in data value, the lighter the schema the better

To clarify, parents and children refer to the immediate nodes directly above or below a specific ID, respectively. In contrast, ancestors and descendants capture all nodes when propagating upwards or downwards from a particular ID.

I understand the distinction, but parents/children are essentially a subset of the ancestors/descendants that we could perhaps model differently. Do we use these anywhere? Perhaps when propagating the evidence.

Then, it looks like ancestors and descendants are just computational sugar, and we need to make sure that rebuilding that information through parents and children across the dataset is actually coherent with this, with no loops, etc.

If using all the up tree or down tree as a client to this dataset is not a use case, I would remove this information.

Agree with the renaming of the synonyms fields.

👍🏻

Agree with the renaming of leaf to isLeaf

👍🏻

opentargets / issues

ETL Output - Disease step, 'diseases', data model and content refactoring proposal #3069

Background

Diseases output data model

Data model refactoring proposal

Data content refactoring proposal